Converting numpy structured array subset to numpy array without copy

784 views Asked by At

Suppose I have the following numpy structured array:

In [250]: x
Out[250]: 
array([(22, 2, -1000000000, 2000), (22, 2, 400, 2000),
       (22, 2, 804846, 2000), (44, 2, 800, 4000), (55, 5, 900, 5000),
       (55, 5, 1000, 5000), (55, 5, 8900, 5000), (55, 5, 11400, 5000),
       (33, 3, 14500, 3000), (33, 3, 40550, 3000), (33, 3, 40990, 3000),
       (33, 3, 44400, 3000)], 
       dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])

I am trying to modify a subset of the above array to a regular numpy array. It is essential for my application that no copies are created (only views).

Fields are retrieved from the above structured array by using the following function:

def fields_view(array, fields):
    return array.getfield(numpy.dtype(
        {name: array.dtype.fields[name] for name in fields}
    ))

If I am interested in fields 'f2' and 'f3', I would do the following:

In [251]: y=fields_view(x,['f2','f3'])
In [252]: y
Out [252]:
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
       dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

There is a way to directly get an ndarray from the 'f2' and 'f3' fields of the original structured array. However, for my application, it is necessary to build this intermediary structured array as this data subset is an attribute of a class.

I can't convert the intermediary structured array to a regular numpy array without doing a copy.

In [253]: y.view(('<f4', len(y.dtype.names)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-f8fc3a40fd1b> in <module>()
----> 1 y.view(('<f4', len(y.dtype.names)))

ValueError: new type not compatible with array.

This function can also be used to convert a record array to an ndarray:

def recarr_to_ndarr(x,typ):

    fields = x.dtype.names
    shape = x.shape + (len(fields),)
    offsets = [x.dtype.fields[name][1] for name in fields]
    assert not any(np.diff(offsets, n=2))
    strides = x.strides + (offsets[1] - offsets[0],)
    y = np.ndarray(shape=shape, dtype=typ, buffer=x,
               offset=offsets[0], strides=strides)
    return y

However, I get the following error:

In [254]: recarr_to_ndarr(y,'<f4')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-65-2ebda2a39e9f> in <module>()
----> 1 recarr_to_ndarr(y,'<f4')

<ipython-input-62-8a9eea8e7512> in recarr_to_ndarr(x, typ)
      8     strides = x.strides + (offsets[1] - offsets[0],)
      9     y = np.ndarray(shape=shape, dtype=typ, buffer=x,
---> 10                offset=offsets[0], strides=strides)
     11     return y
     12 

TypeError: expected a single-segment buffer object

The function works fine if I create a copy:

In [255]: recarr_to_ndarr(np.array(y),'<f4')
Out[255]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
       [  2.00000000e+00,   8.04846000e+05],
       [  2.00000000e+00,   8.00000000e+02],
       [  5.00000000e+00,   9.00000000e+02],
       [  5.00000000e+00,   1.00000000e+03],
       [  5.00000000e+00,   8.90000000e+03],
       [  5.00000000e+00,   1.14000000e+04],
       [  3.00000000e+00,   1.45000000e+04],
       [  3.00000000e+00,   4.05500000e+04],
       [  3.00000000e+00,   4.09900000e+04],
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

There seems to be no difference between the two arrays:

In [66]: y
Out[66]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
      dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

In [67]: np.array(y)
Out[67]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
      dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})
2

There are 2 answers

0
snowleopard On

hpaulj was right in saying that the problem is that the subset of the structured array is not contiguous. Interestingly, I figured out a way to make the array subset contiguous with the following function:

  def view_fields(a, fields):
        """
        `a` must be a numpy structured array.
        `names` is the collection of field names to keep.

        Returns a view of the array `a` (not a copy).
        """
        dt = a.dtype
        formats = [dt.fields[name][0] for name in fields]
        offsets = [dt.fields[name][1] for name in fields]
        itemsize = a.dtype.itemsize
        newdt = np.dtype(dict(names=fields,
                              formats=formats,
                              offsets=offsets,
                              itemsize=itemsize))
        b = a.view(newdt)
        return b

In [5]: view_fields(x,['f2','f3']).flags
Out[5]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

The old function:

In [10]: fields_view(x,['f2','f3']).flags
Out[10]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
1
hpaulj On

This answer is a bit long and rambling. I started with what I knew from previous work on taking array views, and then tried to relate that to your functions.

================

In your case, all fields are 4 bytes long, both floats and ints. I can then view it as all ints or all floats:

In [1431]: x
Out[1431]: 
array([(22, 2.0, -1000000000.0, 2000), (22, 2.0, 400.0, 2000),
       (22, 2.0, 804846.0, 2000), (44, 2.0, 800.0, 4000),
       (55, 5.0, 900.0, 5000), (55, 5.0, 1000.0, 5000),
       (55, 5.0, 8900.0, 5000), (55, 5.0, 11400.0, 5000),
       (33, 3.0, 14500.0, 3000), (33, 3.0, 40550.0, 3000),
       (33, 3.0, 40990.0, 3000), (33, 3.0, 44400.0, 3000)], 
      dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])
In [1432]: x.view('i4')
Out[1432]: 
array([        22, 1073741824, -831624408,       2000,         22,
       1073741824, 1137180672,       2000,         22, 1073741824,
       1229225696,       2000,         44, 1073741824, 1145569280,
      ....     3000])
In [1433]: x.view('f4')
Out[1433]: 
array([  3.08285662e-44,   2.00000000e+00,  -1.00000000e+09,
         2.80259693e-42,   3.08285662e-44,   2.00000000e+00,
  ....   4.20389539e-42], dtype=float32)

This view is 1d. I can reshape and slice the 2 float columns

In [1434]: x.shape
Out[1434]: (12,)
In [1435]: x.view('f4').reshape(12,-1)
Out[1435]: 
array([[  3.08285662e-44,   2.00000000e+00,  -1.00000000e+09,
          2.80259693e-42],
       [  3.08285662e-44,   2.00000000e+00,   4.00000000e+02,
          2.80259693e-42],
         ...
       [  4.62428493e-44,   3.00000000e+00,   4.44000000e+04,
          4.20389539e-42]], dtype=float32)

In [1437]: x.view('f4').reshape(12,-1)[:,1:3]
Out[1437]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
       [  2.00000000e+00,   8.04846000e+05],
       [  2.00000000e+00,   8.00000000e+02],
       ...
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

That this is a view can be verified by doing a bit of inplace math, and seeing the results in x:

In [1439]: y=x.view('f4').reshape(12,-1)[:,1:3]
In [1440]: y[:,0] += .5
In [1441]: y
Out[1441]: 
array([[  2.50000000e+00,  -1.00000000e+09],
       [  2.50000000e+00,   4.00000000e+02],
       ...
       [  3.50000000e+00,   4.44000000e+04]], dtype=float32)
In [1442]: x
Out[1442]: 
array([(22, 2.5, -1000000000.0, 2000), (22, 2.5, 400.0, 2000),
       (22, 2.5, 804846.0, 2000), (44, 2.5, 800.0, 4000),
       (55, 5.5, 900.0, 5000), (55, 5.5, 1000.0, 5000),
       (55, 5.5, 8900.0, 5000), (55, 5.5, 11400.0, 5000),
       (33, 3.5, 14500.0, 3000), (33, 3.5, 40550.0, 3000),
       (33, 3.5, 40990.0, 3000), (33, 3.5, 44400.0, 3000)], 
      dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])

If the field sizes differed this might be impossible. For example if the floats were 8 bytes. The key is picturing how the structured data is stored, and imagining whether that can be viewed as a simple dtype of multiple columns. And field choice has to be equivalent to a basic slice. Working with ['f1','f4'] would be equivalent to advanced indexing with [:,[0,3], which has to be a copy.

==========

The 'direct' field indexing is:

z = x[['f2','f3']].view('f4').reshape(12,-1)
z -= .5

modifies z but with a futurewarning. Also it does not modify x; z has become a copy. I can also see this by looking at z.__array_interface__['data'], the data buffer location (and comparing with that of x and y).

=================

Your fields_view does create a structured view:

In [1480]: w=fields_view(x,['f2','f3'])
In [1481]: w.__array_interface__['data']
Out[1481]: (151950184, False)
In [1482]: x.__array_interface__['data']
Out[1482]: (151950184, False)

which can be used to modify x, w['f2'] -= .5. So it is more versatile than the 'direct' x[['f2','f3']].

The w dtype is

dtype({'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

Adding print(shape, typ, offsets, strides) to your recarr_to_ndarr, I get (py3)

In [1499]: recarr_to_ndarr(w,'<f4')
(12, 2) <f4 [4, 8] (16, 4)
....
ValueError: ndarray is not contiguous

In [1500]: np.ndarray(shape=(12,2), dtype='<f4', buffer=w.data, offset=4, strides=(16,4))
...
BufferError: memoryview: underlying buffer is not contiguous

That contiguous problem must be refering to the values shown in w.flags:

In [1502]: w.flags
Out[1502]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  ....

It's interesting that w.dtype.descr converts the 'offsets' into a unnamed field:

In [1506]: w.__array_interface__
Out[1506]: 
{'data': (151950184, False),
 'descr': [('', '|V4'), ('f2', '<f4'), ('f3', '<f4')],
 'shape': (12,),
 'strides': (16,),
 'typestr': '|V12',
 'version': 3}

One way or other, w has a non-contiguous data buffer, which can't be used to create a new array. Flattened, the data buffer looks something like

xoox|xoox|xoox|...   
# x 4 bytes we want to skip
# o 4 bytes we want to use
# | invisible bdry between records in x

The y I constructed above has:

In [1511]: y.__array_interface__
Out[1511]: 
{'data': (151950188, False),
 'descr': [('', '<f4')],
 'shape': (12, 2),
 'strides': (16, 4),
 'typestr': '<f4',
 'version': 3}

So it accesses the o bytes with a 4 byte offset, and then (16,4) strides, and (12,2) shape.

If I modify your ndarray call to use the original x.data, it works:

In [1514]: xx=np.ndarray(shape=(12,2), dtype='<f4', buffer=x.data, offset=4, strides=(16,4))
In [1515]: xx
Out[1515]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
           ....
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

with the same array_interface as my y:

In [1516]: xx.__array_interface__
Out[1516]: 
{'data': (151950188, False),
 'descr': [('', '<f4')],
 'shape': (12, 2),
 'strides': (16, 4),
 'typestr': '<f4',
 'version': 3}