Numpy structured array creation not working as intended when using another numpy array

89 views Asked by At

I am trying to create a numpy structured array from other arrays in python. However, this does not work as I would expect it:

# this does what I want
In [3]: x = np.array([(1, 2), (3, 4)], dtype=[('foo', 'i8'), ('bar', 'f4')])

In [4]: x['foo']
Out[4]: array([1, 3])

In [5]: x['foo'].shape
Out[5]: (2,)

# when creating the array from another array, the structure is different
In [6]:  y = np.array( np.array([(1, 2), (3, 4)]), dtype=[('foo', 'i8'), ('bar', 'f4')])

In [7]: y['foo'].shape
Out[7]: (2, 2)

# unpacking and packing into a list does not work either
In [8]:   z = np.array([zz for zz in np.array([(1, 2), (3, 4)])], dtype=[('foo', 'i8'), ('bar', 'f4')])

In [9]: z['foo'].shape
Out[9]: (2, 2)

So the structured array for x does what I expect and want. But when you use another numpy array, which I need for my application, the structure is different. And you actually do not access the axes as for x.

Unpacking the values and packing them back does not work either.

Unfortunately the documentation is not clear enough (at least for me), on how to do this. Cheers

2

There are 2 answers

0
chrslg On BEST ANSWER

There might be a more clever way (I have never been fond of ragged nor structured array. I use numpy for good old monolitic array of uniformly typed data. When I need different types or field, I fall back to other things like pandas. So, again, there are probably better ways). But here, I am just translating your attempt into a working one:

arr=np.array([(1, 2), (3, 4)])
y=np.array([tuple(zz) for zz in arr], dtype=[('foo', 'i8'), ('bar', 'f4')])

Idea is quite rudimentary: "if it works with tuple, let them have tuples" :D

But again, maybe there are better ideas

For example

y=np.empty((len(arr),), dtype=[('foo', 'i8'), ('bar', 'f4')])
for i,k in enumerate(y.dtype.names):
    y[k]=arr[:,i]

Also works. And is probably faster. There is still a pure python loop. But it is done only over the fields, when the previous is over the rows. And usually you have way more rows than fields.

As for why it doesn't work from arrays: understand that your wanted result is not a 2x2 2D array, as is your input array. It is a 2×1 array, with each cell being a structure. Reason why it is still quite efficient (numpy can iterate through each fields, with shape and strides, as efficiently as in another array.

So your first line starts from data, 1D list of tuples, from which you build a 1D array of "structure".

Your other attempts start from 2D arrays.

Timing

So, edit, in the mean time I've tested some timings, and I confirm my first opinion: my second code is faster. Under the hypothesis I've made. That is way more rows than fields. Starting from a 10000×2 array, first code runs in 8 ms, while the second runs in 29 μs.

Edit after hpaulj's answer

So, as I was supposing, there is indeed a smarter way. Even tho my python wouldn't let me run his code directly, because it needs a proper type, not [('foo', 'i8'), ('bar', 'f4')]. But that is easily solved

rf.unstructured_to_structured(arr, dtype=np.dtype([('foo', 'i8'), ('bar', 'f4')]))

Nevertheless, timingwise, tho almost as fast, that method seem (strangely) slower than my second. In my same example, it takes 39μs instead of the 29μs of the "empty then for over fields".

still, that is in the same order of magnitude (compared to the 8ms of building a list of tuples), and it might be better to use standard functions rather than reinvented wheels.

1
hpaulj On

There is a big structured array documentation page which you should be familiar with. I'll skip finding the link for you now.

Normally data to a structured array is provided as a list of tuples:

In [44]: x = np.array([(1, 2), (3, 4)], dtype=[('foo', 'i8'), ('bar', 'f4')])

In [45]: x
Out[45]: array([(1, 2.), (3, 4.)], dtype=[('foo', '<i8'), ('bar', '<f4')])

In [46]: x.tolist()
Out[46]: [(1, 2.0), (3, 4.0)]

This input echos the display, and the tolist format. Regular arrays display as lists of lists. The tuple notation has been chosen to clearly define the records of a structured array.

When you try to make y, each of the elements of the array is 'replicated' to match the dtype:

In [47]: y = np.array( np.array([(1, 2), (3, 4)]), dtype=[('foo', 'i8'), ('bar', 'f4')])

In [48]: y
Out[48]: 
array([[(1, 1.), (2, 2.)],
       [(3, 3.), (4, 4.)]], dtype=[('foo', '<i8'), ('bar', '<f4')])

In [49]: y.shape
Out[49]: (2, 2)

Usually that's not what we want. Just to reiterate, the tolist of that inner array is not a list of tuples:

In [50]: np.array([(1, 2), (3, 4)]).tolist()
Out[50]: [[1, 2], [3, 4]]

view and astype don't work any better.

However, the structured array docs discusses a library of recfunctions. Most were written when recarrays were more common (now pandas has replaced a lot of that time series work).

But with some recent reworking of how structured arrays are 'sliced', especially for multiple fields, it has added a couple of utility functionns:

In [52]: import numpy.lib.recfunctions as rf

In [53]: y = rf.unstructured_to_structured( np.array([(1, 2), (3, 4)]), dtype=[('foo', 'i8'), ('bar', 'f4')])

In [54]: y
Out[54]: array([(1, 2.), (3, 4.)], dtype=[('foo', '<i8'), ('bar', '<f4')])

Generally the rf functions create a target array with the desired compound dtype, and copy data to it, field by field. Normally the number of records is much larger than the number of fields, so this iterative copy is relatively efficient.

In any case, take the list-of-tuples specification seriously when working with structured arrays.