Issue:
I am trying to repack a subset of rows and fields from a large numpy structured array.
When I use a slice, I am able to use repack_fields
, but when I use a range
I am not. Calling
range
before repack_fields
appears to be allocating all of the memory needed by the original array.
Example:
Below is an example where I limit the available memory in order to introduce the error I am observing my use case.
import numpy as np
from numpy.lib.recfunctions import repack_fields
import resource
resource.setrlimit(resource.RLIMIT_AS, (int(1.5e9), int(1.5e9)))
H = np.zeros(100, dtype=[('f1', int), ('f2', int), ('large', float, 1000000)])
print('Using slicing: ')
repack_fields(H[['f1', 'f2']][0:50])
print('Using range: ')
repack_fields(H[['f1', 'f2']][range(0, 50)])
produces the output:
Using slicing:
Using range:
Traceback (most recent call last):
File "test.py", line 12, in <module>
repack_fields(H[['f1', 'f2']][range(0, 50)])
MemoryError: Unable to allocate 381. MiB for an array with shape (50,) and data type {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}
Questions:
Why is the behavior of
range(0, 50)
different than0:50
? (A list also doesn't work.) I know in the above example, one could repack the fields first, and then reference the rows. (That is,repack_fields(H[['f1', 'f2']])[range(0, 50)]
works.) But I don't want to have to know whether it is better to get rows first or fields first.What is the correct way to take a subset of rows/fields from a large numpy structured array? (even when the rows are not consecutive)?
Both
[['f1', 'f2']]
and[0:50]
produce aview
, one because it's a multifield index, and the other because it's a slice (basic indexing). So that doesn't require new memory.repack_fields
makes a new array with space for just those 2 fields, and 50 records, and copies values from theview
.Again the fields index is a
view
, referencing the whole structure, including thelarge
field.[range...]
is advanced indexing, making a copy that includes 50 records of 'large'.Look at the error:
There's the 381 MB that it's trying to allocate. The error occurs in the
[range(50)]
indexing. It hasn't gotten to the repack yet.So you have to understand two things.
Indexing with a slice makes a
view
that does not consume added memory. Indexing withrange
or list (or array) is 'advanced indexing' and makes a copy.multifield indexing makes a
view
. Even if the fields are a subset of the source, the itemsize is still the original size. The purpose ofrepack
is to make a new array with new dtype and new itemsize, containing just the values for the selected fields.