-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_zarr: region not recognised as dataset dimensions #6069
Comments
Hi @Boorhin, import xarray as xr
from datetime import datetime,timedelta
import numpy as np
dt= datetime.now()
times= np.arange(dt,dt+timedelta(days=6), timedelta(hours=1))
nodesx,nodesy,layers=np.arange(10,50), np.arange(10,50)+15, np.arange(10)
ds=xr.Dataset()
ds.coords['time']=('time', times)
ds.coords['node_x']=('node', nodesx)
ds.coords['node_y']=('node', nodesy)
ds.coords['layer']=('layer', layers)
outfile='my_zarr'
varnames=['potato','banana', 'apple']
for var in varnames:
ds[var]=(('time', 'layer', 'node'), np.zeros((len(times), len(layers),len(nodesx))))
ds.to_zarr(outfile, mode='a')
for t in range(len(times)):
for var in varnames:
ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx)))
ds.isel(time=slice(t)).to_zarr(outfile, region={"time": slice(t)}) This leads however to another issue: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-52-bb3d2c1adc12> in <module>
18 for var in varnames:
19 ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx)))
---> 20 ds.isel(time=slice(t)).to_zarr(outfile, region={"time": slice(t)})
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks)
2029 encoding = {}
2030
-> 2031 return to_zarr(
2032 self,
2033 store=store,
~/.local/lib/python3.8/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks)
1359
1360 if region is not None:
-> 1361 _validate_region(dataset, region)
1362 if append_dim is not None and append_dim in region:
1363 raise ValueError(
~/.local/lib/python3.8/site-packages/xarray/backends/api.py in _validate_region(ds, region)
1272 ]
1273 if non_matching_vars:
-> 1274 raise ValueError(
1275 f"when setting `region` explicitly in to_zarr(), all "
1276 f"variables in the dataset to write must have at least "
ValueError: when setting `region` explicitly in to_zarr(), all variables in the dataset to write must have at least one dimension in common with the region's dimensions ['time'], but that is not the case for some variables here. To drop these variables from this dataset before exporting to zarr, write: .drop(['node_x', 'node_y', 'layer']) Here, the solution is however provided with the error message. Following the instructions, the snippet below finally works (as far as I can tell): import xarray as xr
from datetime import datetime,timedelta
import numpy as np
dt= datetime.now()
times= np.arange(dt,dt+timedelta(days=6), timedelta(hours=1))
nodesx,nodesy,layers=np.arange(10,50), np.arange(10,50)+15, np.arange(10)
ds=xr.Dataset()
ds.coords['time']=('time', times)
# ds.coords['node_x']=('node', nodesx)
# ds.coords['node_y']=('node', nodesy)
# ds.coords['layer']=('layer', layers)
outfile='my_zarr'
varnames=['potato','banana', 'apple']
for var in varnames:
ds[var]=(('time', 'layer', 'node'), np.zeros((len(times), len(layers),len(nodesx))))
ds.to_zarr(outfile, mode='a')
for t in range(len(times)):
for var in varnames:
ds[var].isel(time=slice(t)).values += np.random.random((len(layers),len(nodesx)))
ds.isel(time=slice(t)).to_zarr(outfile, region={"time": slice(t)}) Maybe one would like to generalise Cheers |
I don't get the second crash. It is not true that these variables are not in common, they are the coordinates of each of the variables. They are all made the same. This is a typical example of an unstructured grid backup. Meanwhile I found an alternate solution which is also better for memory management. I think the documentation example doesn't actually work. |
You are right, the coordinates should not be dropped. I think the function _validate_region has a bug. Currently it checks for all Changing the function to def _validate_region(ds, region):
if not isinstance(region, dict):
raise TypeError(f"``region`` must be a dict, got {type(region)}")
for k, v in region.items():
if k not in ds.dims:
raise ValueError(
f"all keys in ``region`` are not in Dataset dimensions, got "
f"{list(region)} and {list(ds.dims)}"
)
if not isinstance(v, slice):
raise TypeError(
"all values in ``region`` must be slice objects, got "
f"region={region}"
)
if v.step not in {1, None}:
raise ValueError(
"step on all slices in ``region`` must be 1 or None, got "
f"region={region}"
)
non_matching_vars = [
k for k, v in ds.data_vars.items() if not set(region).intersection(v.dims)
]
if non_matching_vars:
raise ValueError(
f"when setting `region` explicitly in to_zarr(), all "
f"variables in the dataset to write must have at least "
f"one dimension in common with the region's dimensions "
f"{list(region.keys())}, but that is not "
f"the case for some variables here. To drop these variables "
f"from this dataset before exporting to zarr, write: "
f".drop({non_matching_vars!r})"
) seems to work. |
The reason why this isn't allowed is because it's ambiguous what to do with the other variables that are not restricted to the region (['cell', 'face', 'layer', 'max_cell_node', 'max_face_nodes', 'node', 'siglay'] in this case). I can imagine quite a few different ways this behavior could be implemented:
I believe your proposal here (removing these checks from (4) seems like perhaps the most user-friendly option, but checking existing variables can add significant overhead. When experimenting adding The current solution is not to do any of these, and to force the user to make an explicit choice by dropping new variables, or write them in a separate call to |
If Xarray/zarr is to replace netcdf, appending by time step is really an important feature With a buffer system, I create a new dataset for each buffer with the right data at the right place meaning only the time interval concerned and I write At the end I write all the parameters before closing the main dataset. To my knowledge, that's the only method which works. |
I'm trying to picture some usage scenarios based on incrementally adding timesteps to data on store. I hope these might help to answer questions from above. In particular, I think that I'll use the following dataset for demonstration code: ds = xr.Dataset({
"T": (("time", "x"), [[1.,2.,3.],[11.,12.,13.]]),
}, coords={
"time": (("time",), [21., 22.]),
"x": (("x",), [100., 200., 300.])
}).chunk({"time": 1})
|
While testing a bit further, I found another case which might potentially be dangerous: # ds is the same as above, but chunksize is {"time": 1, "x": 1}
# once on the coordinator
ds.to_zarr("test.zarr", compute=False, encoding={"time": {"chunks": [1]}, "x": {"chunks": [1]}})
# in parallel
ds.isel(time=slice(0,1), x=slice(0,1)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(0,1)})
ds.isel(time=slice(0,1), x=slice(1,2)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(1,2)})
ds.isel(time=slice(0,1), x=slice(2,3)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(2,3)})
ds.isel(time=slice(1,2), x=slice(0,1)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(0,1)})
ds.isel(time=slice(1,2), x=slice(1,2)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(1,2)})
ds.isel(time=slice(1,2), x=slice(2,3)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(2,3)}) This example doesn't produce any error, but the |
I have looked at these examples and I still don't manage to make it work in the real world. |
I did make
|
I don't yet know a proper answer, but there'd be three observations I have:
|
The _FillValue is always the same (np.nan) and specified when I reproject with rioxarray. |
🤷 can't help any further without a minimal reproducible example here... |
OK that's not exactly the same error message, I could not even start the appending. But that's basically one example that could be tested. A model would want to compute each of these variables step by step and variable by variable and save them for each single iteration. There is no need of concurrent writing as most of the resources are focused on the modelling. import xarray as xr
from rasterio.enums import Resampling
import numpy as np
ds = xr.tutorial.open_dataset('air_temperature').isel(time=0)
ds = ds.rio.write_crs('EPSG:4326')
dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
dst.to_zarr('test.zarr') Returns
|
This error ist unrelated to region or append writes. The dataset
but still carries encoding-information from
The encoding get's picked up by |
Ok, I believe, I've now reproduced your error: import xarray as xr
from rasterio.enums import Resampling
import numpy as np
ds = xr.tutorial.open_dataset('air_temperature').isel(time=0)
ds = ds.rio.write_crs('EPSG:4326')
dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
dst.air.encoding = {}
dst = dst.assign(air=dst.air.expand_dims("time"), time=dst.time.expand_dims("time"))
m = {}
dst.to_zarr(m)
dst.to_zarr(m, append_dim="time") raises:
This seems to be due to handling of CF-Conventions which might go wrong in the append case: the |
In my case I specify _fillvalue in the reprojection so I would not think this is an issue to overwrite it. |
btw, as a work-around it works when removing the del dst.air.attrs["_FillValue"]
dst.to_zarr(m, append_dim="time") works. But still, this might call for another issue to solve. |
Effectively I have unstable results with sometimes errors with timesteps refusing to write /opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py:2050: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
safe_chunks=safe_chunks, the crashes are related to dimension of time itself but time is always of size 1, so it is hard to understand /tmp/ipykernel_1629/1269180709.py in aggregate_with_time(farm_name, resolution_M, canvas, W, H, master_raster_coordinates)
39 raster.drop(
40 ['x','y']).to_zarr(
---> 41 uri, mode='a', append_dim='time')
42 #except:
43 #print('something went wrong')
/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
2048 append_dim=append_dim,
2049 region=region,
-> 2050 safe_chunks=safe_chunks,
2051 )
2052
/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
1406 _validate_datatypes_for_zarr_append(dataset)
1407 if append_dim is not None:
-> 1408 existing_dims = zstore.get_dimensions()
1409 if append_dim not in existing_dims:
1410 raise ValueError(
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in get_dimensions(self)
450 if d in dimensions and dimensions[d] != s:
451 raise ValueError(
--> 452 f"found conflicting lengths for dimension {d} "
453 f"({s} != {dimensions[d]})"
454 )
ValueError: found conflicting lengths for dimension time (2 != 1) |
I have tried to specify the chunk before writing the dataset and I have had some really strange behaviour with data written into the same chunks, the time dimension never went over 5, growing and reducing through the processing... |
I've made a simpler example of the import numpy as np
import xarray as xr
ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim="x") raises
I'd expect this to just work (effectively concatenating the dataset to itself). The workaround: m = {}
ds.to_zarr(m)
del ds.a.attrs["_FillValue"]
ds.to_zarr(m, append_dim="x") does the trick, but doesn't look right. @dcherian, @Boorhin should we make a new (CF-related) issue out of this and try to keep focussing on append and region use-cases here, which seemed to be the initial problem in this thread (probably by going further through your example @Boorhin?). |
@d70-t we can try to branch it to the CF related issue yes. |
👍 to creating a new issue with your minimal example (I think we're just missing a check whether the Dataset and on-disk fill values are equal). It did seem like there were two issues mixed up here. Thanks for confirming that. |
I'll set up a new issue. @Boorhin, I couldn't confirm the weirdness with the small example, but will put in a note to your comment. If you can reproduce the weirdness on the minimal example, would you make a comment to the new issue? |
(edited) Am going through old zarr issues. I think the salient part of this issue is, quoting from above:
Would it make sense to provide an option for these? There's a proposal at #6260 for pursing (3) for coords. I would be +1 on allowing that as an option (though not doing it without an option) |
#8428 is also related to this insofar as it touches how coordinates are handled in append operations. |
What happened:
I am trying to write into a zarr a dataset by reading each timesteps and write this time step into a zarr store. For that I prepared the dataset and filled the variable values with 0s and wrote the store with
mode='a'
. I then perform the operations I need on each variables and try to write the time step of the dataset.ds.isel(time=t).to_zarr(outfile, region={"time": t})
However I received this error message:
ValueError: all keys in ``region`` are not in Dataset dimensions, got ['time'] and ['cell', 'face', 'layer', 'max_cell_node', 'max_face_nodes', 'node', 'siglay']
But
Checking in the API, it comes from
What you expected to happen:
Incremently append data to the zarr store
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.8.10 (default, Sep 28 2021, 16:10:42)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-91-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3
xarray: 0.16.2
pandas: 1.2.2
numpy: 1.17.4
scipy: 1.6.2
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.10.2
cftime: 1.1.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.1.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: None
setuptools: 45.2.0
pip: 20.0.2
conda: None
pytest: 6.2.1
IPython: 7.13.0
sphinx: None
The text was updated successfully, but these errors were encountered: