Converting from awkward arrays into torch arrays

34 views Asked by At

Note: I am using awkward version 1.10.3.

So, the general overview is that I have a set of data that is in awkward arrays, and I want to be able to pass this data to a simple feedforward pytorch model. I believe that pytorch doesn't natively handle awkward arrays so I am planning on converting the data to either torch or numpy arrays before passing through to the model. I should also note that whilst the data is stored in awkward arrays, at this point the data is not jagged.

Here is an example of the input data and of what I am looking for:

import awkward as ak
import numpy as np
import torch


arr = ak.Array({"MET_pt" : [0.0, 100.0, 20.0, 30.0, 4.0],
                 "MET_phi" : [0, 0.1, 0.2, 0.3, 0.4],
                 "class" : [0, 1, 0, 1, 0]})

# These are my input features
x = arr[['MET_pt', 'MET_phi']]
# These are my class labels
y = arr['class']
#
## Here would be the code converting to torch tensors 
# 
x_torch = torch.tensor([[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]])

y_torch = torch.tensor([0, 1, 0, 1, 0])

However, I cannot find an easy way to convert x from the awkward arrays to the torch arrays. I can easily convert y to torch tensors by simply doing:

torch.tensor(y)
> tensor([0, 1, 0, 1, 0])

But I am unable to do this for the x array:

torch.tensor(x)
> TypeError: object of type 'Record' has no len()

This lead me to the idea of converting to numpy arrays first:

torch.tensor(ak.to_numpy(x))
> TypeError: can't convert np.ndarray of type numpy.void. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool

But as you can see this doesn't work either.

I think the problem lies in the fact that the ak.to_numpy() function converts the x array to:

ak.to_numpy(x)
> array([(  0., 0. ), (100., 0.1), ( 20., 0.2), ( 30., 0.3), (  4., 0.4)],
      dtype=[('MET_pt', '<f8'), ('MET_phi', '<f8')])

where I want it to convert like:

ak.to_numpy(x)

> [[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]]

Is there anyway of converting an N-dim non-jagged awkward array such as x into the format shown immediately above? Or is there a smarter way to convert directly to torch tensors?

Sorry if this is a stupid question! Thanks!

2

There are 2 answers

2
Muhammed Yunus On

One approach is to convert it to a list of dictionaries using to_list(), and then read out the numerical values. Converting it directly using to_numpy() seems to result in the keys being tied up in the dtypes, which is why I opted for to_list().

#Read out the values from each dictionary entry in arr.to_list()
arr_dicts = arr.to_list()
arr_dict_values = [list(arr_dict.values()) for arr_dict in arr_dicts]

#To numpy
arr_np = np.array(arr_dict_values)

#To float32 tensor. Could supply "arr_np" or "arr_dict_values" here.
arr_t = torch.tensor(arr_dict_values).float()

#Slice out X and y tensors
x_t = arr_t[:, 0:2]
y_t = arr_y[:, 2]
0
Jim Pivarski On

You've already converted y; the problem with x is that it's not a purely numerical array. You could convert x to a NumPy structured array,

>>> ak.to_numpy(x)
array([(  0., 0. ), (100., 0.1), ( 20., 0.2), ( 30., 0.3), (  4., 0.4)],
      dtype=[('MET_pt', '<f8'), ('MET_phi', '<f8')])

which can then be viewed and reshaped to get the array that you want:

>>> ak.to_numpy(x).view("<f8").reshape(-1, 2)
array([[  0. ,   0. ],
       [100. ,   0.1],
       [ 20. ,   0.2],
       [ 30. ,   0.3],
       [  4. ,   0.4]])

but this relies strongly on the fact that all of the fields are the same type, "<f8" (doubles). If you had a mix of floating-point numbers and integers (charge?), or numbers of different bit-widths, then this wouldn't work.

Here's a better method: break up x (or the original arr) into its two fields, first.

>>> x["MET_pt"]
<Array [0, 100, 20, 30, 4] type='5 * float64'>
>>> x["MET_phi"]
<Array [0, 0.1, 0.2, 0.3, 0.4] type='5 * float64'>

What you want to do is interleave these so that you get one value from "MET_pt", followed by one value from "MET_phi", then the next value from "MET_pt", and so on. If you first put the values in length-1 lists, which is a reshaping (can be done in Awkward or NumPy, with the same syntax),

>>> x["MET_pt", :, np.newaxis]
<Array [[0], [100], [20], [30], [4]] type='5 * 1 * float64'>
>>> x["MET_phi", :, np.newaxis]
<Array [[0], [0.1], [0.2], [0.3], [0.4]] type='5 * 1 * float64'>

then what you want is to concatenate each of these length-1 lists from the first array with each of the length-1 lists from the second array. That is, you want to concatenate them, not concatenation at axis=0, but concatenation at axis=1, the first level deep of lists (see ak.concatenate or np.concatenate).

>>> np.concatenate((x["MET_pt", :, np.newaxis], x["MET_phi", :, np.newaxis]), axis=1)
<Array [[0, 0], [100, 0.1], ..., [30, ...], [4, 0.4]] type='5 * 2 * float64'>

Now you can pass it to Torch.

>>> torch.tensor(np.concatenate((
...     x["MET_pt", :, np.newaxis], x["MET_phi", :, np.newaxis]
... ), axis=1))
tensor([[  0.0000,   0.0000],
        [100.0000,   0.1000],
        [ 20.0000,   0.2000],
        [ 30.0000,   0.3000],
        [  4.0000,   0.4000]], dtype=torch.float64)