I am trying to figure out how to speed up the following Python code.
Basically, the code builds the matrix of outter products of a matrix C
and stores it as block diagonal sparse matrix.
I use numpy.repeat()
to build indices into the block diagonal.
Profiling the code revealed that calls to numpy.repeat()
take about 50 % of the execution time.
import numpy as np
import scipy.sparse as spspar
L = 1000
K = 100
C = np.random.randn(L,K)
# From the matrix of outter products of C and store in block_diagonal
# sparse matrix
CCt = np.einsum('ij...,i...->ij...',C,C)
# create indices into the block diagonal sparse coo matrix
i = np.tile(np.tile(np.arange(K),K),L) + K*np.repeat(np.arange(L),K*K)
j = np.tile(np.repeat(np.arange(K),K),L) + K*np.repeat(np.arange(L),K*K)
# store as block diagonal sparse coo matrix
BlckCCt = spspar.coo_matrix((CCt.flatten(),(j,i)),shape=(K*K*L,K*K*L))
Initially, I was building the sparse matrix as follows
BlckCCt = spspar.block_diag(CCt,"coo")
This is waay too slow and memory intensive.
Thanks for any input.
Edit: I compared @hjpaul's suggestions using ipython timeit. Here's what I can report
timeit K*np.repeat(np.arange(L),K*K)
10 loops, best of 3: 82.1 ms per loop
timeit (np.zeros((K*K,),int)+np.arange(L)[:,None]).flatten()*K
10 loops, best of 3: 89.9 ms per loop
timeit np.tile(np.arange(L)*K,K*K).reshape(K*K,L).T.flatten()
10 loops, best of 3: 85.5 ms per loop
So it looks like they all take about the same amount (I'm new to ipython profiling so perhaps I'm not comparing them the right way).
Just for reference, your
is the same as
producing a
(L,K,K)
array. For my smaller test casenp.einsum
is 2x faster.sparse.block_diag
converts each submatrix tocoo
, and passes them tosparse.bmat
.bmat
collects therows
,cols
,data
of all the sub matrices into a big arrays similar to yourj, i
, and callscoo_matrix
with those.Doing
ipython
timeit
on various pieces, I agree thatK*np.repeat(np.arange(L),K*K)
is the slowest block of code. Much slower, for example, than thetile
piece.Since you are doing the same
repeat
fori
andj
, can't you do it just once, and use that variable twice?I'll look at that piece some more, but that's a start.
Here's a slight improvement (20%) on the
repeat
:even better (>2x)
I moved
*K
to the smallerarange(L)
, and use fastertile
..T.flatten
takes care of changing the order.As per comment, the reshape should be
(K*K,L)
. I was testing with values where that didn't matter. And relative speeds of these alternatives vary with the relative sizes ofK
andL
.The tiling for the first part of
i
andj
is optional, ifkk
(the 2nd part) is shape (L,K,K) (likeCCt
). Whether it saves time is unclear. Striding is more complicated than with the fully tiled version(0,4,0)
v.(4,)
.)we could do the same with
kk
np.arange(K)[None,None,:] + k1
is (L,1,K), so we need to tile itAnother way to generate these arrays would be to use
np.ix_
to reshape the ranges, and then just sum values.(add
.flatten
as needed). I've tested this on small dimensions and it looks right. I don't know about speed.