I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.
import numpy as np
#import cupy as np
def l1_distance(arr):
return np.linalg.norm(arr, 1)
X = np.random.randint(low=0, high=255, size=(700,4096))
distance = np.empty((700,700))
for i in range(700):
for j in range(700):
distance[i,j] = l1_distance(X[i,:] - X[j,:])
I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.
I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.
Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds
This vectorizes and makes the distance of one vector to all other ones in parallel.