xarray with masked arrays while preserving integer dtypes

2.2k views Asked by At

Currently, my code heavily uses structured masked arrays with multidimensional dtypes, with dozens of fields and item sizes of many kilobytes. It appears that xarray could be a great alternative, but when I try to pass it a masked array, it changes its dtype to float:

In [137]: x = arange(30, dtype="i1").reshape(3, 10)

In [138]: xr.Dataset({"count": (["x", "y"], ma.masked_where(x%5>3, x))}, coords={"x": range(3), "y":
     ...: range(10)})
Out[138]:
<xarray.Dataset>
Dimensions:  (x: 3, y: 10)
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9
  * x        (x) int64 0 1 2
Data variables:
    count    (x, y) float64 0.0 1.0 2.0 3.0 nan 5.0 6.0 7.0 8.0 nan 10.0 ...

This is undesirable for me, because (1) the memory consumption of my dataset will explode (it is already large), and (2) many of my integer-dtypes are bit fields which must not be represented as floats. Although an int32 bitfield can be losslessly represented as a float64, it's ugly and error-prone to go back and forth.

Is it possible to use xarray.Dataset with masked arrays while preserving integer dtypes?


Edit: It appears the problem occurs in _maybe_promote. See also github issue.

1

There are 1 answers

0
shoyer On BEST ANSWER

Unfortunately, xarray does not support masked arrays or any form of integer dtypes with missing values. The reasons for this choice are the same as those why pandas does not (currently) support integer NAs, as described the pandas docs under Cavaets and Gotchas. We would need an integer dtype that supports missing values for NumPy arrays, which unfortunately does not exist.

I agree that this is not a very satisfactory solution for images with missing values, though in many cases I have found it suffices to either work with non-masked integer data, converting to float (and masking missing values) only when necessary for arithmetic (e.g., making use of .fillna()).

With regards to memory usage, I recommend trying xarray with dask, which allows for performing most array operations in a streaming or distributed fashion.