Auto broadcasting in Scipy

8.6k views Asked by At

I have two np.ndarrays, data with shape (8000, 500) and sample with shape (1, 500).

What I am trying to achieve is measure various types of metrics between every row in data to sample.

When using from sklearn.metrics.pairwise.cosine_distances I was able to take advantage of numpy's broadcasting executing the following line

x = cosine_distances(data, sample)

But when I tried to use the same procedure with scipy.spatial.distance.cosine I got the error

ValueError: Input vector should be 1-D.

I guess this is a broadcasting issue and I'm trying to find a way to get around it.

My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance that can accept two vectors and apply them to the data and the sample.

How can I replicate the broadcasting that automatically happens in sklearn's in my scipy version of the code?

1

There are 1 answers

2
hpaulj On BEST ANSWER

OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html

With (800,500) and (1,500) inputs ((samples, features)), you should get back a (800,1) result ((samples1, samples2)).

I wouldn't describe that as broadcasting. It's more like dot product, that performs some sort calculation (norm) over features (the 500 shape), reducing that down to one value. It's more like np.dot(data, sample.T) in its handling of dimensions.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is Computes the Cosine distance between 1-D arrays, more like

for row in data:
   for s in sample:
      d = cosine(row, s)

or since sample has only one row

distances = np.array([cosine(row, sample[0]) for row in data])

In other words, the sklearn version does the pairwise iteration (maybe in compiled code), while the spartial just evaluates the distance for one pair.

pairwise.cosine_similarity does

 # K(X, Y) = <X, Y> / (||X||*||Y||)
 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

That's the dot like behavior that I mentioned earlier, but with the normalization added.