I have two np.ndarrays, data with shape (8000, 500) and sample with shape (1, 500).
What I am trying to achieve is measure various types of metrics between every row in data to sample.
When using from sklearn.metrics.pairwise.cosine_distances I was able to take advantage of numpy's broadcasting executing the following line
x = cosine_distances(data, sample)
But when I tried to use the same procedure with scipy.spatial.distance.cosine I got the error
ValueError: Input vector should be 1-D.
I guess this is a broadcasting issue and I'm trying to find a way to get around it.
My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance that can accept two vectors and apply them to the data and the sample.
How can I replicate the broadcasting that automatically happens in sklearn's in my scipy version of the code?
OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
With (800,500) and (1,500) inputs (
(samples, features)), you should get back a (800,1) result ((samples1, samples2)).I wouldn't describe that as broadcasting. It's more like
dotproduct, that performs some sort calculation (norm) over features (the 500 shape), reducing that down to one value. It's more likenp.dot(data, sample.T)in its handling of dimensions.https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is
Computes the Cosine distance between 1-D arrays, more likeor since
samplehas only one rowIn other words, the
sklearnversion does the pairwise iteration (maybe in compiled code), while thespartialjust evaluates the distance for one pair.pairwise.cosine_similaritydoesThat's the
dotlike behavior that I mentioned earlier, but with the normalization added.