Numpy repack_fields with range or a list allocates memory

145 views Asked by At


I am trying to repack a subset of rows and fields from a large numpy structured array.

When I use a slice, I am able to use repack_fields, but when I use a range I am not. Calling range before repack_fields appears to be allocating all of the memory needed by the original array.


Below is an example where I limit the available memory in order to introduce the error I am observing my use case.

import numpy as np
from numpy.lib.recfunctions import repack_fields
import resource

resource.setrlimit(resource.RLIMIT_AS, (int(1.5e9), int(1.5e9)))

H = np.zeros(100, dtype=[('f1', int), ('f2', int), ('large', float, 1000000)])

print('Using slicing: ')
repack_fields(H[['f1', 'f2']][0:50])
print('Using range: ')
repack_fields(H[['f1', 'f2']][range(0, 50)])

produces the output:

Using slicing: 
Using range: 
Traceback (most recent call last):
  File "", line 12, in <module>
    repack_fields(H[['f1', 'f2']][range(0, 50)])
MemoryError: Unable to allocate 381. MiB for an array with shape (50,) and data type {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}


  1. Why is the behavior of range(0, 50) different than 0:50? (A list also doesn't work.) I know in the above example, one could repack the fields first, and then reference the rows. (That is, repack_fields(H[['f1', 'f2']])[range(0, 50)] works.) But I don't want to have to know whether it is better to get rows first or fields first.

  2. What is the correct way to take a subset of rows/fields from a large numpy structured array? (even when the rows are not consecutive)?


There are 1 answers

repack_fields(H[['f1', 'f2']][0:50])

Both [['f1', 'f2']] and [0:50] produce a view, one because it's a multifield index, and the other because it's a slice (basic indexing). So that doesn't require new memory. repack_fields makes a new array with space for just those 2 fields, and 50 records, and copies values from the view.

repack_fields(H[['f1', 'f2']][range(0, 50)])

Again the fields index is a view, referencing the whole structure, including the large field. [range...] is advanced indexing, making a copy that includes 50 records of 'large'.

Look at the error:

 Unable to allocate 381. MiB for an array with shape (50,) and data type
 {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}

 In [336]: 50*8000016/1e6
 Out[336]: 400.0008

There's the 381 MB that it's trying to allocate. The error occurs in the [range(50)] indexing. It hasn't gotten to the repack yet.

So you have to understand two things.

  • Indexing with a slice makes a view that does not consume added memory. Indexing with range or list (or array) is 'advanced indexing' and makes a copy.

  • multifield indexing makes a view. Even if the fields are a subset of the source, the itemsize is still the original size. The purpose of repack is to make a new array with new dtype and new itemsize, containing just the values for the selected fields.