I have a database of zarr groups on gcs, and everytime I try to load them using xarray it takes over 5 seconds regardless of the size.
I tried saving the files with different chunking parameters and compression libraries
def _apply_zarr_encoding(ds):
encoding = {}
dtype = {v: "int32" for v in ds.data_vars}
for data_var in ds.data_vars:
encoding[data_var] = {
"compressor": COMPRESSOR,
"_FillValue": -32767,
}
encoding[data_var].update(
{"dtype": dtype.get(data_var, "int32"), "scale_factor": 1e-3}
)
ds[data_var].encoding = {}
return encoding
encoding = _apply_zarr_encoding(ds)
ds.to_zarr('gs://data.zarr', consolidated=True, encoding=encoding, mode="w", zarr_version=2)
%timeit xr.open_zarr('gs://data.zarr', chunks="auto", overwrite_encoded_chunks=True)
5.63 s ± 546 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
When these files are read from disk they take no time. i.e.
%timeit xr.open_zarr('data.zarr', chunks="auto", overwrite_encoded_chunks=True)
9.05 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a way to optimize this through compression, encoding or consolidation that I'm not doing correctly?