Python: How to optimize loading zarr files/groups from gcs using xarray?

57 views Asked by At

I have a database of zarr groups on gcs, and everytime I try to load them using xarray it takes over 5 seconds regardless of the size.

I tried saving the files with different chunking parameters and compression libraries

def _apply_zarr_encoding(ds):
    encoding = {}
    dtype = {v: "int32" for v in ds.data_vars}

    for data_var in ds.data_vars:
        encoding[data_var] = {
            "compressor": COMPRESSOR,
            "_FillValue": -32767,
        }
        encoding[data_var].update(
            {"dtype": dtype.get(data_var, "int32"), "scale_factor": 1e-3}
        )
        ds[data_var].encoding = {}

    return encoding

encoding = _apply_zarr_encoding(ds)
ds.to_zarr('gs://data.zarr', consolidated=True, encoding=encoding, mode="w", zarr_version=2)


%timeit xr.open_zarr('gs://data.zarr', chunks="auto", overwrite_encoded_chunks=True)
5.63 s ± 546 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When these files are read from disk they take no time. i.e.

%timeit xr.open_zarr('data.zarr', chunks="auto", overwrite_encoded_chunks=True)
9.05 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a way to optimize this through compression, encoding or consolidation that I'm not doing correctly?

0

There are 0 answers