With xarray.Dataset.to_zarr it is possible to write an xarray to a .zarr
file and append new data along a dimension using the append_dim
parameter.
However, if the coordinate of the new data for this dimension is already there, the existing data won't be replaced. Rather the same coordinate appears twice in the resulting dateset.
Example using the data from here:
Here I write 2 Datasets to the same .zarr file. The datasets are appended along the space
dimension. Both datasets contain the same space coordinate "IL"
ds_A = xr.DataArray(
np.random.rand(4, 2),
[
("time", pd.date_range("2000-01-01", periods=4)),
("space", ["IA", "IL"]),
],
).to_dataset(name="measurements")
ds_B = xr.DataArray(
np.random.rand(4, 2),
[
("time", pd.date_range("2000-01-01", periods=4)),
("space", ["IL", "NY"]),
],
).to_dataset(name="measurements")
ds_A.to_zarr("weather.zarr", append_dim="space")
ds_B.to_zarr("weather.zarr", append_dim="space");
When reading the file, the second dataset didn't overwrite the data for the "IL"
coordinate, but crated a new one:
xr.open_zarr("weather.zarr")
<xarray.Dataset>
Dimensions: (space: 4, time: 4)
Coordinates:
* space (space) <U2 'IA' 'IL' 'IL' 'NY'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04
Data variables:
measurements (time, space) float64 dask.array<chunksize=(4, 2), meta=np.ndarray>
This would be the desired result:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* space (space) <U2 'IA' 'IL' 'NY'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04
Data variables:
measurements (time, space) float64 dask.array<chunksize=(3, 2), meta=np.ndarray>
Does anybody know if it is possible to replace the data if the coordinate already exists?
I don't think there's an out-of-the-box way to do this, appending always adds the full dataset to the end.
However, version
0.16.2
ofxarray
introduced the keywordregion
toto_zarr
, which lets you write to limited region of azarr
file.You can use it, to overwrite the existing data: