ZARR: Inefficiency in Selecting Data Based on Bounding Box Coordinates with TIME Dimensions Only dataset

45 views Asked by At

I have the following dataset which is a trajectory type dataset. LATITUDE and LONGITUDE are defined as coordinates in xarray BUT they aren't dimensions! All of the issues I've came across on stackoverflow and forums have LATITUDE and LONGITUDE as dimension.

Here's a brief overview of the dataset.

Dimensions:
TIME: 120385119

Coordinates:
DEPTH (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
LATITUDE (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
LONGITUDE (TIME) float64 dask.array<chunksize=(10000,), meta=np.ndarray>
TIME (TIME) datetime64[ns]
2008-11-25T00:54:26.443568128 .....

Data variables: (61)

Indexes: (1)

Attributes: (43)

My issue is that this dataset, currently stored on S3 as a ZARR dataset, is horrendously slow at selecting data based on a geographical bounding box. Selecting based on TIME is fine though since it's a dimension.

In a perfect world, I'd like to achieve this:

%%time
import xarray as xr
import fsspec

# remote zarr dataset
url = 's3://imos-data-lab-optimised/zarr/target.zarr'
ds = xr.open_zarr(fsspec.get_mapper(url, anon=True), consolidated=True,  chunks={'TIME': 10000}) # maybe important to load the ds with chunks not to kill the amount of ram

subset = ds[['PSAL', 'TEMP', 'DEPTH']]
subset = subset.where((ds.LONGITUDE>=151) & (ds.LONGITUDE<=152) & (ds.LATITUDE>=-34) & (ds.LATITUDE<=-33), drop=True)
subset

Unfortunately, the current process is time-consuming since it involves downloading all the chunks for LATITUDE and LONGITUDE to locate the required data, which is counterproductive. I aim to accomplish a more efficient approach similar to setting an index on a column, as seen in PostgreSQL, for instance.

I've tried several approaches, including indexing, using a pandas multiindex, and attempting regridding (though it seems impractical due to the array's size). However, I find myself going in circles, and I can't believe no one else has encountered this issue.

I'm open to suggestions in possibly re-writting this zarr dataset, considering that the dataset is not static and grows with new trajectories and timestamp overtime.

0

There are 0 answers