I'm facing a problem when trying to shuffle a multi-dimensional array with numpy. The problem can be reproduced with the following code:
import numpy as np
s=(300000, 3000)
n=s[0]
print ("Allocate")
A=np.zeros(s)
B=np.zeros(s)
print ("Index")
idx = np.arange(n)
print ("Shuffle")
idx = np.random.shuffle(idx)
print ("Arrange")
B[:,:] = A[idx,:] # THIS REQUIRES A LARGE AMOUNT OF MEMORY
When running this code (python 2.7 as well as python 3.6 with numpy 1.13.1 on win7 64bit), the execution of the last line of code is requiring a large amount of memory (~ 10 Gb), which sound strange to me.
Actually, I'm expecting the data to be copied from an array to another, both being pre-allocated, so I can understand that the copy will consume time, but not understand why it requires memory.
I guess I do something wrong but don't find what... maybe someone can help me?
 
                        
From the
numpydocumentation under 'Index arrays':In other words, your assumption that your line
B[:,:] = A[idx,:](after correcting the line pointed out by @MSeifert) only induces copying of elements fromAtoBis not correct. Insteadnumpyfirst creates a new array from the indexedAbefore copying its elements intoB.Why the memory usage changes so much is beyond me. However, looking at your original array shape,
s=(300000,3000), this would, for 64 bit numbers, amount to roughly 6.7 GB, if I didn't calculate wrong. Thus creating that additional array, the extra memory usage actually seems plausible.EDIT:
Reacting to the OP's comments, I did a few tests concerning the performance of different ways to assign the shuffled rows of
AtoB. First off, here a small test thatB=A[idx,:]indeed creates a newndarray, not just a view ofA:So indeed, assigning new values to
bleavesaunchanged. Then I did a few timing tests concerning the fastest way to shuffle the rows ofAand getting them intoB:The results (min, max, mean) of 7 runs are:
In the end, a simple
for-loop does not perform too badly, especially if you want to only assign part of the rows, not the entire array. Surprisinglynumbadoes not seem to enhance performance.