I have a bunch of NetCDF (.nc
) files (ERA5 dataset) that I'm reading in Python through xarray
and rioxarray
. They end up as arrays of float32
(4 bytes) in memory.
However, on disk they are stored as short
(2 bytes):
$ ncdump -h file.nc
...
short u100(time, latitude, longitude) ;
u100:scale_factor = 0.000895262699529722 ;
u100:add_offset = 2.29252111865024 ;
u100:_FillValue = -32767s ;
u100:missing_value = -32767s ;
...
Apparently xarray automatically applies the offset and scale factor to convert these integers back into floats while reading the NetCDF file.
Now I'm rechunking these and storing them as zarr, so I can efficiently access entire time series at a single geographical location. However, the zarr files end up at almost twice the size of the original NetCDFs, because the data remain stored as floats. Because it's about a terabyte in its original form, bandwidth and storage considerations are important, so I'd like to make this smaller. And we're not gaining anything by this additional storage size; the incoming data only had 16 bits of precision to begin with.
I know I could just manually convert the data back to shorts on the way into zarr, and back to floats on the way out of zarr, but that's tedious and error-prone (even when it happens automatically).
Is there a way to do this transparently, the way it seems to happen with NetCDF?
I had been writing with the
zarr
package directly, which doesn't seem to support this. Butxarray
does, through itsencoding
argument!The zarr on disk ends up with the right format, scaling and attributes:
When opening the dataset, we have to use
xarray
as well, and passmask_and_scale=True
to apply the scaling: