Multiply high order matrices with numpy

1.2k views Asked by At

I created this toy problem that reflects my much bigger problem:

import numpy as np

ind = np.ones((3,2,4)) # shape=(3L, 2L, 4L)
dist = np.array([[0.1,0.3],[1,2],[0,1]]) # shape=(3L, 2L)

ans = np.array([np.dot(dist[i],ind[i]) for i in xrange(dist.shape[0])]) # shape=(3L, 4L)
print ans

""" prints:
   [[ 0.4  0.4  0.4  0.4]
    [ 3.   3.   3.   3. ]
    [ 1.   1.   1.   1. ]]
"""

I want to do it as fast as possible, so using numpy's functions to calculate ans should be the best approach, since this operation is heavy and my matrices are quite big.

I saw this post, but the shapes are different and I cannot understand which axes I should use for this problem. However, I'm certain that tensordot should have the answer. Any suggestions?

EDIT: I accepted @ajcr's answer, but please read my own answer as well, it may help others...

2

There are 2 answers

0
Alex Riley On BEST ANSWER

You could use np.einsum to do the operation since it allows for very careful control over which axes are multiplied and which are summed:

>>> np.einsum('ijk,ij->ik', ind, dist)
array([[ 0.4,  0.4,  0.4,  0.4],
       [ 3. ,  3. ,  3. ,  3. ],
       [ 1. ,  1. ,  1. ,  1. ]])

The function multiplies the entries in the first axis of ind with the entries in the first axis of dist (subscript 'i'). Ditto for the second axis of each array (subscript 'j'). Instead of returning a 3D array, we tell einsum to sum the values along axis 'j' by omitting it from the output subscripts, thereby returning a 2D array.


np.tensordot is more difficult to apply to this problem. It automatically sums the products of axes. However, we want two sets of products but to sum only one of them.

Writing np.tensordot(ind, dist, axes=[1, 1]) (as in the answer you linked to) computes the correct values for you, but returns a 3D array with shape (3, 4, 3). If you can afford the memory cost of a larger array, you could use:

np.tensordot(ind, dist, axes=[1, 1])[0].T

This gives you the correct result, but because tensordot creates a much larger-than-necessary array first, einsum seems to be a better option.

2
AvidLearner On

Following @ajcr's great answer, I wanted to determine which method is the fastest, so I used timeit :

import timeit

setup_code = """
import numpy as np
i,j,k = (300,200,400)
ind = np.ones((i,j,k)) #shape=(3L, 2L, 4L)
dist = np.random.rand(i,j) #shape=(3L, 2L)
"""

basic ="np.array([np.dot(dist[l],ind[l]) for l in xrange(dist.shape[0])])"
einsum = "np.einsum('ijk,ij->ik', ind, dist)"
tensor= "np.tensordot(ind, dist, axes=[1, 1])[0].T"

print "tensor - total time:", min(timeit.repeat(stmt=tensor,setup=setup_code,number=10,repeat=3))
print "basic - total time:", min(timeit.repeat(stmt=basic,setup=setup_code,number=10,repeat=3))
print "einsum - total time:", min(timeit.repeat(stmt=einsum,setup=setup_code,number=10,repeat=3))

The surprising results were:

tensor - total time: 6.59519493952
basic - total time: 0.159871203461
einsum - total time: 0.263569731028

So obviously using tensordot was the wrong way to do it (not to mention memory error in bigger examples, just as @ajcr stated).

Since this example was small, I changed the matrices size to be i,j,k = (3000,200,400), flipped the order just to be sure it has not effect and set up another test with higher numbers of repetitions:

print "einsum - total time:", min(timeit.repeat(stmt=einsum,setup=setup_code,number=50,repeat=3))
print "basic - total time:", min(timeit.repeat(stmt=basic,setup=setup_code,number=50,repeat=3))

The results were consistent with the first run:

einsum - total time: 13.3184077671
basic - total time: 8.44810031351

However, testing another type of size growth - i,j,k = (30000,20,40) led the following results:

einsum - total time: 0.325594117768
basic - total time: 0.926416766397

See the comments for explanations for these results.

The moral is, when looking for the fastest solution for a specific problem, try to generate data that is as similar to the original data as possible, in term of types and shapes. In my case i is much smaller than j,k and so I stayed with the ugly version, which is also the fastest in this case.