-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray open_mfdataset with engine Zarr #4187
Conversation
…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``
…ctory paths/strings
These all sound good to me! I agree that we shouldn't change the default behavior for |
xarray/backends/zarr.py
Outdated
if "auto_chunk" in kwargs: | ||
auto_chunk = kwargs.pop("auto_chunk") | ||
if auto_chunk: | ||
chunks = "auto" # maintain backwards compatibility | ||
else: | ||
chunks = None | ||
|
||
warnings.warn( | ||
"auto_chunk is deprecated. Use chunks='auto' instead.", | ||
FutureWarning, | ||
stacklevel=2, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I would suggest keeping this compatibility code and not (yet) marking open_zarr
as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I can revert the open_zarr deprecation warning. Edit: Done at 40c4d46.
Should I also keep the "auto_chunk" compatibility here?
@@ -487,10 +489,40 @@ def maybe_decode_store(store, lock=False): | |||
) | |||
name_prefix = "open_dataset-%s" % token | |||
ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token) | |||
ds2._file_obj = ds._file_obj | |||
|
|||
elif engine == "zarr": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this code is that introducing an entirely separate code-path inside open_dataset()
for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.
I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk
method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'
? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.
I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this code is that introducing an entirely separate code-path inside open_dataset() for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.
I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.
Thanks for pointing this out! I agree completely that the open_dataset()
function is overdue for a refactor, it was a nightmare to go through all the if-then branches, but the comprehensive test suite helped to catch most of the bugs and I've tested it on my own real world dataset so the logic should be ok for now 🤞.
I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.
Personally I would prefer to hold this off, since this open_mfdataset
PR (and the previous one at #4003) has been sitting around for months, and I've had to resolve quite a few merge conflicts to keep up. No point in contaminating this complex PR by refactoring the NetCDF logic either.
xarray/backends/api.py
Outdated
# auto chunking needs to be here and not in ZarrStore because | ||
# the variable chunks does not survive decode_cf | ||
# return trivial case | ||
if not chunks: # e.g. chunks is 0, None or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cam we make this if chunks is None
instead?
I know this is a discrepancy from how open_zarr()
works today, but currently open_dataset(..., chunks={})
is a way to open a dataset with dask chunks equal to the full size of any arrays.
I doubt the (different) behavior of open_zarr
in this case was intentional....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try it locally and see if the tests break, felt like there was a reason it had to be not chunks
but can't remember the context now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Not a big deal if we have to push off this clean-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, 2 tests would break with a ZeroDivisionError
if we switch to if chunks is None
. Specifically:
- TestZarrDictStore.test_manual_chunk
- TestZarrDirectoryStore.test_manual_chunk
Related to my comment hidden in the mess above at #4187 (comment) 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would expect test_manual_chunk
to fail here: it is explicitly verifying that chunks=0
and chunks={}
result in-memory numpy arrays. Does it work if you remove those cases, e.g., by setting NO_CHUNKS = (None,)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, setting NO_CHUNKS = (None,)
works with if chunks is None
. I'll make the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps by the way, test_manual_chunk
with NO_CHUNKS = (None, 0, {})
was added in #2530.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can go ahead and change that. It doesn't look like that was carefully evaluated in #2530.
ca4e526
to
2c73e0b
Compare
xarray/backends/api.py
Outdated
chunk for all arrays. | ||
chunk for all arrays. When using ``engine="zarr"`, if `chunks='auto'`, | ||
dask chunks are created based on the variable's zarr chunks, and if | ||
`chunks=None`, zarr array data will lazily convert to numpy arrays upon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior for chunks=None
is the same for all backends. The only special behavior for zarr
is chunks='auto'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will update the docs.
xarray/tests/test_backends.py
Outdated
NO_CHUNKS = (None,) | ||
for no_chunk in NO_CHUNKS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove the loop here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I did think of that actually!
Thanks @weiji14 and @Mikejmnez for your contribution! |
Note that zarr.open* now works with fsspec URLs (in master) |
Thanks @weiji14 and @Mikejmnez . This is a great contribution. |
…pagate-attrs * 'propagate-attrs' of github.com:dcherian/xarray: (22 commits) silence sphinx warnings about broken rst (pydata#4448) Xarray open_mfdataset with engine Zarr (pydata#4187) Fix release notes formatting (pydata#4443) fix typo in io.rst (pydata#4250) Fix typo (pydata#4181) Fix release notes typo New whatsnew section Add notes re doctests (pydata#4440) Fixed dask.optimize on datasets (pydata#4438) Release notes for 0.16.1 (pydata#4435) Small updates to How-to-release + lint (pydata#4436) Fix doctests (pydata#4439) add a ci for doctests (pydata#4437) preserve original dimension, coordinate and variable order in ``concat`` (pydata#4419) Fix for h5py deepcopy issues (pydata#4426) Keep the original ordering of the coordinates (pydata#4409) Clearer Vectorized Indexing example (pydata#4433) Revert "Fix optimize for chunked DataArray (pydata#4432)" (pydata#4434) Fix optimize for chunked DataArray (pydata#4432) fix doc dataarray to netcdf (pydata#4424) ...
Work on enabling
xr.open_dataset(..., engine="zarr")
, toreplacecomplementxr.open_zarr
. This also allows `xr.open_mfdataset(..., engine="zarr") to be used.Note: Credit should be given to @Mikejmnez, I'm just continuing this on from #4003.
isort -rc . && black . && mypy . && flake8
whats-new.rst