# Numpy - Finding Nearest Neighbors of a Matrix Multiplication

I have a dataset of a thousand 128 dimensional features in the shape of e.g. (1000,128).

I want to find the sorted nearest neighbors of a 128 dimensional feature in the shape of (128,1).

The distance in calculated via a Matrix Multiplication between dataset (1000,128) and feature (128,1) which would give an array of similarities in the shape of (1000,1) :

## DATASET (1000,128) x FEATURE (128,1) = SIMILARITIES (1000,1)

This is done via:

``````# features.shape=(1000,128) ; feature.shape=(128,1) ; similarities.shape=(1000,1)
similarities = features.dot(feature)
``````

After calculating the distance (similarities), I'm finding the nearest neighbors using the code below:

``````# The n Nearest Neighbors Indexes (But Not Sorted)
nearest_neighbours_indexes_unsorted = np.argpartition(similarities, kth=-n)[-n:]

# The n Nearest Neighbors (But Not Sorted)
nearest_neighbours_similarities_unsorted = similarities[nearest_neighbours_indexes_unsorted]

# The Indexes of n Nearest Neighbors Sorted
nearest_neighbours_indexes_sorted = np.flip(nearest_neighbours_indexes_unsorted[np.argsort(nearest_neighbours_similarities_unsorted)], axis=0)
``````

This code works very fast for millions of data (I'm interested if someone has a tip to make it faster) But I want to be able to find the nearest neighbors of more than one feature in one go:

## DATASET (1000,128) x FEATURE (128,n) = SIMILARITIES (1000,n)

One way is to calculate the above code for each feature in a loop (which is slow) and the other way is to change the code to accommodate for multidimensional indexing and here's where I'm stuck: I don't know how to write the above code for features in the shape of (128,n) and not (128,1). On Best Solutions

### Helper functions to get largest, smallest n-indices, elements along an axis

Here's a helper function to select top `n-largest` indices along a generic axis from a generic ndarray making use of `np.argpartition` and `np.take_along_axis` -

``````def take_largest_indices_along_axis(ar, n, axis):
s = ar.ndim*[slice(None,None,None)]
s[axis] = slice(-n,None,None)
idx = np.argpartition(ar, kth=-n, axis=axis)[tuple(s)]
sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
return np.flip(np.take_along_axis(idx, sidx, axis=axis),axis=axis)
``````

Extending this to get n-smallest indices -

``````def take_smallest_indices_along_axis(ar, n, axis):
s = ar.ndim*[slice(None,None,None)]
s[axis] = slice(None,n,None)
idx = np.argpartition(ar, kth=n, axis=axis)[tuple(s)]
sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
return np.take_along_axis(idx, sidx, axis=axis)
``````

And extending these to select the largest or smallest `n` elements themselves, it would be with a simple usage of `np.take_along_axis` as listed next -

``````def take_largest_along_axis(ar, n, axis):
idx = take_largest_indices_along_axis(ar, n, axis)
return np.take_along_axis(ar, idx, axis=axis)

def take_smallest_along_axis(ar, n, axis):
idx = take_smallest_indices_along_axis(ar, n, axis)
return np.take_along_axis(ar, idx, axis=axis)
``````

Sample runs

``````# Sample setup
In : np.random.seed(0)
...: ar = np.random.randint(0,99,(5,5))

In : ar
Out:
array([[44, 47, 64, 67, 67],
[ 9, 83, 21, 36, 87],
[70, 88, 88, 12, 58],
[65, 39, 87, 46, 88],
[81, 37, 25, 77, 72]])
``````

Take largest `n` indices, elements along axis -

``````In : take_largest_indices_along_axis(ar, n=2, axis=0)
Out:
array([[4, 2, 2, 4, 3],
[2, 1, 3, 0, 1]])

In : take_largest_indices_along_axis(ar, n=2, axis=1)
Out:
array([[4, 3],
[4, 1],
[2, 1],
[4, 2],
[0, 3]])

In : take_largest_along_axis(ar, n=2, axis=0)
Out:
array([[81, 88, 88, 77, 88],
[70, 83, 87, 67, 87]])

In : take_largest_along_axis(ar, n=2, axis=1)
Out:
array([[67, 67],
[87, 83],
[88, 88],
[88, 87],
[81, 77]])
``````

Take smallest `n` indices, elements along axis -

``````In : take_smallest_indices_along_axis(ar, n=2, axis=0)
Out:
array([[1, 4, 1, 2, 2],
[0, 3, 4, 1, 0]])

In : take_smallest_indices_along_axis(ar, n=2, axis=1)
Out:
array([[0, 1],
[0, 2],
[3, 4],
[1, 3],
[2, 1]])

In : take_smallest_along_axis(ar, n=2, axis=0)
Out:
array([[ 9, 37, 21, 12, 58],
[44, 39, 25, 36, 67]])

In : take_smallest_along_axis(ar, n=2, axis=1)
Out:
array([[44, 47],
[ 9, 21],
[12, 58],
[39, 46],
[25, 37]])
``````

### Solving our case here

For our case, let's assume the input is `similarities` and is of shape `(1000,128)` representing 1000 data points and 128 features and that we want to look for largest say `n=10` features for each of those data points, then it would be -

``````take_largest_indices_along_axis(similarities, n=10, axis=1) # indices
take_largest_along_axis(similarities, n=10, axis=1) # elements
``````

The final indices/values array would be of shape `(1000, n)`.

Sample run with the given dataset shape -

``````In : np.random.seed(0)
...: similarities = np.random.randint(0,99,(1000,128))

In : take_largest_indices_along_axis(similarities, n=10, axis=1).shape
Out: (1000, 10)

In : take_largest_along_axis(similarities, n=10, axis=1).shape
Out: (1000, 10)
``````

If instead you were looking to get `n` largest data-points for each of those features, that is the final indices/values array would be of shape `(n, 128)`, then use `axis=0`.