Why is copying from a ndarray to another ndarray memory consuming?

150 views Asked by At

I'm facing a problem when trying to shuffle a multi-dimensional array with numpy. The problem can be reproduced with the following code:

import numpy as np
s=(300000, 3000)
n=s[0]
print ("Allocate")
A=np.zeros(s)
B=np.zeros(s)
print ("Index")
idx = np.arange(n)
print ("Shuffle")
idx = np.random.shuffle(idx)
print ("Arrange")
B[:,:] = A[idx,:] # THIS REQUIRES A LARGE AMOUNT OF MEMORY

When running this code (python 2.7 as well as python 3.6 with numpy 1.13.1 on win7 64bit), the execution of the last line of code is requiring a large amount of memory (~ 10 Gb), which sound strange to me.

Actually, I'm expecting the data to be copied from an array to another, both being pre-allocated, so I can understand that the copy will consume time, but not understand why it requires memory.

I guess I do something wrong but don't find what... maybe someone can help me?

2

There are 2 answers

7
Thomas Kühn On BEST ANSWER

From the numpy documentation under 'Index arrays':

NumPy arrays may be indexed with other arrays (or any other sequence- like object that can be converted to an array, such as lists, with the exception of tuples; see the end of this document for why this is). The use of index arrays ranges from simple, straightforward cases to complex, hard-to-understand cases. For all cases of index arrays, what is returned is a copy of the original data, not a view as one gets for slices.

In other words, your assumption that your line B[:,:] = A[idx,:] (after correcting the line pointed out by @MSeifert) only induces copying of elements from A to B is not correct. Instead numpy first creates a new array from the indexed A before copying its elements into B.

Why the memory usage changes so much is beyond me. However, looking at your original array shape, s=(300000,3000), this would, for 64 bit numbers, amount to roughly 6.7 GB, if I didn't calculate wrong. Thus creating that additional array, the extra memory usage actually seems plausible.

EDIT:

Reacting to the OP's comments, I did a few tests concerning the performance of different ways to assign the shuffled rows of A to B. First off, here a small test that B=A[idx,:] indeed creates a new ndarray, not just a view of A:

>>> import numpy as np
>>> a = np.arange(9).reshape(3,3)
>>> a
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> b = a[[2,0,1],:]
>>> b
array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]])
>>> b[0]=-5
>>> b
array([[-5, -5, -5],
       [ 0,  1,  2],
       [ 3,  4,  5]])
>>> a
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

So indeed, assigning new values to b leaves a unchanged. Then I did a few timing tests concerning the fastest way to shuffle the rows of A and getting them into B:

import numpy as np
import timeit
import numba as nb

s=(300000, 3000)
A = np.arange(s[0]*s[1]).reshape(s)
idx = np.arange(s[0])

#directly keep the indexed array
def test1(x,idx):  
    return x[idx,:]

#the method of the OP
def test2(x, y, idx):
    y[:,:]=x[idx,:]
    return y

#using a simple for loop, e.g. if only part of the rows should be assigned
def test3(x,y,idx):
    for i in range(len(idx)):
        y[i,:] = x[idx[i],:]
    return y

#like test3, but numba-compiled
@nb.jit(nopython=True)
def test4(x,y,idx):
    for i in range(len(idx)):
        y[i,:] = x[idx[i],:]
    return y

B = np.zeros(s)
res = timeit.Timer(
    'test1(A,idx)',
    setup = 'from __main__ import test1, A, idx'
    ).repeat(7,1)

print('test 1:', np.min(res), np.max(res), np.mean(res))

B = np.zeros(s)
res = timeit.Timer(
    'test2(A,B,idx)',
    setup = 'from __main__ import test2, A, B, idx'
    ).repeat(7,1)

print('test 2:', np.min(res), np.max(res), np.mean(res))


B = np.zeros(s)
res = timeit.Timer(
    'test3(A,B,idx)',
    setup = 'from __main__ import test3, A, B, idx'
    ).repeat(7,1)

print('test 3:', np.min(res), np.max(res), np.mean(res))


B = np.zeros(s)
res = timeit.Timer(
    'test4(A,B,idx)',
    setup = 'from __main__ import test4, A, B, idx'
    ).repeat(7,1)

print('test 4:', np.min(res), np.max(res), np.mean(res))

The results (min, max, mean) of 7 runs are:

test 1: 19.880664938 21.354912988 20.2604536371
test 2: 73.419507756 139.534279557 122.949712777
test 3: 40.030043285 78.001182537 64.7852914216
test 4: 40.001512514 73.397133578 62.0058947516

In the end, a simple for-loop does not perform too badly, especially if you want to only assign part of the rows, not the entire array. Surprisingly numba does not seem to enhance performance.

6
MSeifert On

The problem isn't the copying the problem is that your arrays are huge:

>>> 300000 * 3000 * 8 / 1024 / 1024 / 1024  # 8 byte floats, 300000 * 3000 elements converted to GB
6.705522537231445

So the arrays are almost 7GB huge. So how come it only triggers at the assignment line B[:,:] = A[idx,:]?

That's because zeros doesn't actually allocate the array until you want to use it. And you don't use it until you index it (in case of A: A[idx, :]) or assign to it (in case of B: B[:,:] =).

So there's nothing strange going on, it's just the amount of memory you actually need for A and B.