I have two np.ndarray
s, data
with shape (8000, 500)
and sample
with shape (1, 500)
.
What I am trying to achieve is measure various types of metrics between every row in data
to sample
.
When using from sklearn.metrics.pairwise.cosine_distances
I was able to take advantage of numpy
's broadcasting executing the following line
x = cosine_distances(data, sample)
But when I tried to use the same procedure with scipy.spatial.distance.cosine
I got the error
ValueError: Input vector should be 1-D.
I guess this is a broadcasting issue and I'm trying to find a way to get around it.
My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance
that can accept two vectors and apply them to the data and the sample.
How can I replicate the broadcasting that automatically happens in sklearn
's in my scipy
version of the code?
OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
With (800,500) and (1,500) inputs (
(samples, features)
), you should get back a (800,1) result ((samples1, samples2)
).I wouldn't describe that as broadcasting. It's more like
dot
product, that performs some sort calculation (norm
) over features (the 500 shape), reducing that down to one value. It's more likenp.dot(data, sample.T)
in its handling of dimensions.https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is
Computes the Cosine distance between 1-D arrays
, more likeor since
sample
has only one rowIn other words, the
sklearn
version does the pairwise iteration (maybe in compiled code), while thespartial
just evaluates the distance for one pair.pairwise.cosine_similarity
doesThat's the
dot
like behavior that I mentioned earlier, but with the normalization added.