Increase Dimensionality of a xarray from coordinates

624 views Asked by At

Say I have the following 2d-array

>>> import numpy as np
>>> budgets = np.array([
       [np.nan, 450.],
       [500.  , 100.],
       [np.nan, 900.],
    ])

whose values are positioned like so

>>> coords = [
        ('name' , ['Jack_teen' , 'John_adult', 'John_teen']), # over rows
        ('hobby', ['books', 'bicyle']),                       # over columns
    ]

Using xarray I can create a 2d labeled array, doing

>>> import xarray as xr
>>> x = xr.DataArray(budgets, coords=coords)

Thus when John was a teenager, he did not like books, which is visible if one gets its budget at that time

>>> x.sel(name='John_teen', hobby='books')
<xarray.DataArray ()>
array(nan)
Coordinates:
    name     |S10 'John_teen'
    hobby    |S6 'books'

What has changed with age

>>> x.sel(name='John_adult', hobby='books')
<xarray.DataArray ()>
array(500.0)
Coordinates:
    name     |S10 'John_adult'
    hobby    |S6 'books'


My question:

How would you do to turn this 2dl-array into a 3dl-array which considers a new dimension called age (whose coordinates would thus be ['adult','teen']) while simplifying the coordinates of the dimension name?

Note that name's coordinates are always structured with a separating underscore, I mean as NAME_AGE. Of course the object with which you start to do this is x.

Are there xarray-builtin manners to do this ? Or at least what is the fastest/cheapest approach ?

2

There are 2 answers

1
Michael Delgado On BEST ANSWER

Since we eventually want a dimension 'name', I'll rename the current 'name' to 'name_age':

In [5]: x = x.rename({'name': 'name_age'})

We can construct a MultiIndex directly from the coordinate values and assign this as a stacked DataArray coordinate:

In [6]: x.coords['name_age'] = pd.MultiIndex.from_tuples(
   ...:     [tuple(s.split('_')) for s in x.coords['name_age'].values],
   ...:     names=['name', 'age'])

In [7]: x
Out[7]:
<xarray.DataArray (name_age: 3, hobby: 2)>
array([[  nan,  450.],
       [ 500.,  100.],
       [  nan,  900.]])
Coordinates:
  * name_age  (name_age) MultiIndex
  - name      (name_age) object 'Jack' 'John' 'John'
  - age       (name_age) object 'teen' 'adult' 'teen'
  * hobby     (hobby) |S6 'books' 'bicyle'

If you then unstack 'name_age', you'll get the 3-D DataArray you want:

In [8]: x.unstack('name_age')
Out[8]:
<xarray.DataArray (hobby: 2, name: 2, age: 2)>
array([[[  nan,   nan],
        [ 500.,   nan]],

       [[  nan,  450.],
        [ 100.,  900.]]])
Coordinates:
  * hobby    (hobby) |S6 'books' 'bicyle'
  * name     (name) object 'Jack' 'John'
  * age      (age) object 'adult' 'teen'
0
keepAlive On

Actually, this dirty approach is what I am going to do, but this just cannot be the best solution.

First, let turn this 2dl-array into a dict formed over tuple keys.

dict_ = {}
for hobby in x['hobby'].data:
    for name_age in x['name'].data:
        name,age = name_age.split('_')
        dict_[(hobby, name, age,)] = x.sel(name=name_age, hobby=hobby).data

The space in which these values are located is formed over the following list of dimensions: ['hobby', 'name', 'age']. Let assign it

>>> space = ['hobby', 'name', 'age']

Then, one can use the method from_tuples of pandas's MultiIndex object to build the boolean-locating structure of our data

>>> import pandas as pd 
>>> index = pd.MultiIndex.from_tuples(dict_.keys(), names=space)    

And finally,

>>> hyper_x = pd.Series(dict_, index=index).to_xarray()

Thus

>>> hyper_x.sel(name='John', age='teen', hobby='books')
<xarray.DataArray ()>
array(nan)
Coordinates:
    hobby    |S5 'books'
    name     |S4 'John'
    age      |S4 'teen'
>>> hyper_x.sel(name='John', age='adult', hobby='books')
<xarray.DataArray ()>
array(500.0)
Coordinates:
    hobby    |S5 'books'
    name     |S4 'John'
    age      |S5 'adult'


The advantage of this approach is that it can be easily generalized to any number of dimensions, be it of x or hyper_x. And it can be used to decrease the dimensionality as well.