Numpy matrix row comparison

4.5k views Asked by At

The question is more focused on performance of calculation.

I have 2 matrix with the same number of columns and different number of rows. One matrix is the 'pattern' whose rows have to be compared separately with the other matrix rows (all rows), then to be able to extract statistical values of mean equal to pattern, std,... So, I have the following matrix and the computation is the following one:

numCols = 10
pattern = np.random.randint(0,2,size=(7,numCols))
matrix = np.random.randint(0,2,size=(5,numCols))

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = np.mean(np.sum(pattern[i,:] == matrix, axis=1))

print comp_mean # Output example: [ 1.6  1.   1.6  2.2  2.   2.   1.6]

This is clear. The problem is that the number of matrix rows of both is much bigger (~1.000.000). So this code goes very slow. I tryed to implement numpy syntaxis as sometimes it surprises me improving the calculation time. So I did the following code (it could be strange, but it works!):

comp_mean = np.mean( np.sum( (pattern[np.repeat(np.arange(pattern.shape[0]), matrix.shape[0])].ravel() == np.tile(matrix.ravel(),pattern.shape[0])).reshape(pattern.shape[0],matrix.shape[0],matrix.shape[1]), axis=2 ),axis=1)
print comp_mean

However, this code is slower than the previous one where the 'for' bucle is used. So I would like to know if there is any possibility to speed up the calculation.

EDIT

I have checked the runtime of the different approaches for the real matrix and the result is the following:

  • Me - Approach 1: 18.04 seconds
  • Me - Approach 2: 303.10 seconds
  • Divakar - Approach 1: 18.79 seconds
  • Divakar - Approach 2: 65.11 seconds
  • Divakar - Approach 3.1: 137.78 seconds
  • Divakar - Approach 3.2: 59.59 seconds
  • Divakar - Approach 4: 6.06 seconds

EDIT(2)

Previous runs where performed in a laptop. I have run the code on a desktop. I have avoided the worst results, and the new runtimes are now different:

  • Me - Approach 1: 6.25 seconds
  • Divakar - Approach 1: 4.01 seconds
  • Divakar - Approach 2: 3.66 seconds
  • Divakar - Approach 4: 3.12 seconds
2

There are 2 answers

4
Divakar On BEST ANSWER

Few approaches with broadcasting could be suggested here.

Approach #1

out = np.mean(np.sum(pattern[:,None,:] == matrix[None,:,:],2),1)

Approach #2

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = (pattern[:,None,:] == matrix[None,:,:]).reshape(prows,-1).sum(1)/mrows

Approach #3

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:]).astype(int))/mrows
# OR out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:])+0)/mrows

Approach #4

If the number of rows in matrix is a huge number, it could be better to stick to a for-loop to avoid the huge memory requirements for such a case, that might also lead to slow runtimes. Instead, we could do some optimizations within each loop iteration. Here's one such possible optimization shown -

mrows = matrix.shape[0]
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = (pattern[i,:] == matrix).sum()
comp_mean = comp_mean/mrows
6
Thomas Baruchel On

could you have a try at this:

import scipy.ndimage.measurements

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    m = scipy.ndimage.measurements.histogram(matrix,0,1,2,pattern[i],[0,1])
    comp_mean[i] = m[0][0]+m[1][1]
comp_mean /= matrix.shape[0]

Regards.