I would like to compute the Earth Mover Distance between two 2D arrays (these are not images).
Right now I go through two libraries: scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).
#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
#sample from N(0, 1) in the 2D hyperspace
x = np.random.randn(n, 2)
#scale N(0, 1) -> N(mu, std)
x[:,0] = (x[:,0]*std1) + mu1
x[:,1] = (x[:,1]*std2) + mu2
return x
#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)
#compute the distance
distance = pyemd.emd_samples(Y1, Y2)
While the scipy version doesn't accept 2D arrays and it returns an error, the pyemd method returns a value. If you see from the documentation, it says that it accept only 1D arrays, so I think that the output is wrong. How can I calculate this distance in this case?
So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using
scipy.spatial.distance.cdist
, and in fact SciPy provides a solver for the linear sum assignment problem as well inscipy.optimize.linear_sum_assignment
(which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).In other words, what you want to do boils down to
It is also possible to use
scipy.sparse.csgraph.min_weight_bipartite_full_matching
as a drop-in replacement forlinear_sum_assignment
; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:
At this point, we can verify that the approach above agrees with the minimum cost flow:
Similarly, it's instructive to see that the result agrees with
scipy.stats.wasserstein_distance
for 1-dimensional inputs: