Numpy matrix row comparison

Question

Numpy matrix row comparison

4.5k views Asked by iblasi At 23 November 2024 at 01:20

The question is more focused on performance of calculation.

I have 2 matrix with the same number of columns and different number of rows. One matrix is the 'pattern' whose rows have to be compared separately with the other matrix rows (all rows), then to be able to extract statistical values of mean equal to pattern, std,... So, I have the following matrix and the computation is the following one:

numCols = 10
pattern = np.random.randint(0,2,size=(7,numCols))
matrix = np.random.randint(0,2,size=(5,numCols))

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = np.mean(np.sum(pattern[i,:] == matrix, axis=1))

print comp_mean # Output example: [ 1.6  1.   1.6  2.2  2.   2.   1.6]

This is clear. The problem is that the number of matrix rows of both is much bigger (~1.000.000). So this code goes very slow. I tryed to implement numpy syntaxis as sometimes it surprises me improving the calculation time. So I did the following code (it could be strange, but it works!):

comp_mean = np.mean( np.sum( (pattern[np.repeat(np.arange(pattern.shape[0]), matrix.shape[0])].ravel() == np.tile(matrix.ravel(),pattern.shape[0])).reshape(pattern.shape[0],matrix.shape[0],matrix.shape[1]), axis=2 ),axis=1)
print comp_mean

However, this code is slower than the previous one where the 'for' bucle is used. So I would like to know if there is any possibility to speed up the calculation.

EDIT

I have checked the runtime of the different approaches for the real matrix and the result is the following:

Me - Approach 1: 18.04 seconds
Me - Approach 2: 303.10 seconds
Divakar - Approach 1: 18.79 seconds
Divakar - Approach 2: 65.11 seconds
Divakar - Approach 3.1: 137.78 seconds
Divakar - Approach 3.2: 59.59 seconds
Divakar - Approach 4: 6.06 seconds

EDIT(2)

Previous runs where performed in a laptop. I have run the code on a desktop. I have avoided the worst results, and the new runtimes are now different:

Me - Approach 1: 6.25 seconds
Divakar - Approach 1: 4.01 seconds
Divakar - Approach 2: 3.66 seconds
Divakar - Approach 4: 3.12 seconds

Original Q&A

There are 2 answers

Thomas Baruchel On 12 June 2015 at 13:07

could you have a try at this:

import scipy.ndimage.measurements

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    m = scipy.ndimage.measurements.histogram(matrix,0,1,2,pattern[i],[0,1])
    comp_mean[i] = m[0][0]+m[1][1]
comp_mean /= matrix.shape[0]

Regards.

**Divakar** · Accepted Answer · 2015-06-12 09:34:51

Few approaches with broadcasting could be suggested here.

Approach #1

out = np.mean(np.sum(pattern[:,None,:] == matrix[None,:,:],2),1)

Approach #2

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = (pattern[:,None,:] == matrix[None,:,:]).reshape(prows,-1).sum(1)/mrows

Approach #3

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:]).astype(int))/mrows
# OR out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:])+0)/mrows

Approach #4

If the number of rows in matrix is a huge number, it could be better to stick to a for-loop to avoid the huge memory requirements for such a case, that might also lead to slow runtimes. Instead, we could do some optimizations within each loop iteration. Here's one such possible optimization shown -

mrows = matrix.shape[0]
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = (pattern[i,:] == matrix).sum()
comp_mean = comp_mean/mrows

TechQA.

Numpy matrix row comparison

There are 2 answers

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in NUMPY

Related Questions in MATRIX

Related Questions in COMPARISON

Popular Questions

Popular Tags

Trending Questions