Xarray open_mfdataset with engine Zarr #4187

weiji14 · 2020-06-30T02:32:07Z

Work on enabling xr.open_dataset(..., engine="zarr"), to ~~replace~~ complement xr.open_zarr. This also allows `xr.open_mfdataset(..., engine="zarr") to be used.

Note: Credit should be given to @Mikejmnez, I'm just continuing this on from #4003.

Closes open_mfdataset: support for multiple zarr datasets #3668, closes xarray.open_mzar: open multiple zarr files (in parallel) #4003
Tests added
Passes isort -rc . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst

…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``

…ctory paths/strings

shoyer · 2020-09-11T05:35:29Z

Just to note down a few things:

The deprecated "auto_chunk" kwarg was removed

open_zarr uses chunks="auto" by default, whereas open_dataset uses chunks=None (see my comment inline)

The different default chunk behaviour (point 2) is worth raising, and it might be best to postpone the deprecation of open_zarr until the next release, so that there's time to discuss what we want the default setting to be (None or auto).

These all sound good to me!

I agree that we shouldn't change the default behavior for open_dataset, and should keep open_zarr around for now -- there is no urgent need to deprecate it.

shoyer · 2020-09-11T05:43:30Z

xarray/backends/zarr.py

-    if "auto_chunk" in kwargs:
-        auto_chunk = kwargs.pop("auto_chunk")
-        if auto_chunk:
-            chunks = "auto"  # maintain backwards compatibility
-        else:
-            chunks = None

-        warnings.warn(
-            "auto_chunk is deprecated. Use chunks='auto' instead.",
-            FutureWarning,
-            stacklevel=2,
-        )


For now I would suggest keeping this compatibility code and not (yet) marking open_zarr as deprecated.

Ok I can revert the open_zarr deprecation warning. Edit: Done at 40c4d46.

Should I also keep the "auto_chunk" compatibility here?

doc/io.rst

shoyer · 2020-09-11T05:56:28Z

xarray/backends/api.py

@@ -487,10 +489,40 @@ def maybe_decode_store(store, lock=False):
            )
            name_prefix = "open_dataset-%s" % token
            ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token)
-            ds2._file_obj = ds._file_obj
+
+        elif engine == "zarr":


My main concern with this code is that introducing an entirely separate code-path inside open_dataset() for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.

I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.

I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.

My main concern with this code is that introducing an entirely separate code-path inside open_dataset() for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.

I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.

Thanks for pointing this out! I agree completely that the open_dataset() function is overdue for a refactor, it was a nightmare to go through all the if-then branches, but the comprehensive test suite helped to catch most of the bugs and I've tested it on my own real world dataset so the logic should be ok for now 🤞.

I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.

Personally I would prefer to hold this off, since this open_mfdataset PR (and the previous one at #4003) has been sitting around for months, and I've had to resolve quite a few merge conflicts to keep up. No point in contaminating this complex PR by refactoring the NetCDF logic either.

This reverts commit cd0b9ef.

Partially reverts b488363.

doc/whats-new.rst

shoyer · 2020-09-21T05:40:22Z

xarray/backends/api.py

+            # auto chunking needs to be here and not in ZarrStore because
+            # the variable chunks does not survive decode_cf
+            # return trivial case
+            if not chunks:  # e.g. chunks is 0, None or {}


Cam we make this if chunks is None instead?

I know this is a discrepancy from how open_zarr() works today, but currently open_dataset(..., chunks={}) is a way to open a dataset with dask chunks equal to the full size of any arrays.

I doubt the (different) behavior of open_zarr in this case was intentional....

I'll try it locally and see if the tests break, felt like there was a reason it had to be not chunks but can't remember the context now.

Thanks! Not a big deal if we have to push off this clean-up.

Yep, 2 tests would break with a ZeroDivisionError if we switch to if chunks is None. Specifically:

TestZarrDictStore.test_manual_chunk

TestZarrDirectoryStore.test_manual_chunk

Related to my comment hidden in the mess above at #4187 (comment) 😄

Well, I would expect test_manual_chunk to fail here: it is explicitly verifying that chunks=0 and chunks={} result in-memory numpy arrays. Does it work if you remove those cases, e.g., by setting NO_CHUNKS = (None,) ?

Yes, setting NO_CHUNKS = (None,) works with if chunks is None. I'll make the change?

If it helps by the way, test_manual_chunk with NO_CHUNKS = (None, 0, {}) was added in #2530.

Yes, I think we can go ahead and change that. It doesn't look like that was carefully evaluated in #2530.

shoyer · 2020-09-21T23:26:52Z

xarray/backends/api.py

-        chunk for all arrays.
+        chunk for all arrays. When using ``engine="zarr"`, if `chunks='auto'`,
+        dask chunks are created based on the variable's zarr chunks, and if
+        `chunks=None`, zarr array data will lazily convert to numpy arrays upon


This behavior for chunks=None is the same for all backends. The only special behavior for zarr is chunks='auto'.

Ok, will update the docs.

shoyer · 2020-09-21T23:27:25Z

xarray/tests/test_backends.py

+        NO_CHUNKS = (None,)
        for no_chunk in NO_CHUNKS:


could you remove the loop here now?

Sure, I did think of that actually!

shoyer · 2020-09-22T05:40:57Z

Thanks @weiji14 and @Mikejmnez for your contribution!

martindurant · 2020-09-22T14:41:41Z

Note that zarr.open* now works with fsspec URLs (in master)

dcherian · 2020-09-22T15:08:36Z

Thanks @weiji14 and @Mikejmnez . This is a great contribution.

…pagate-attrs * 'propagate-attrs' of github.com:dcherian/xarray: (22 commits) silence sphinx warnings about broken rst (pydata#4448) Xarray open_mfdataset with engine Zarr (pydata#4187) Fix release notes formatting (pydata#4443) fix typo in io.rst (pydata#4250) Fix typo (pydata#4181) Fix release notes typo New whatsnew section Add notes re doctests (pydata#4440) Fixed dask.optimize on datasets (pydata#4438) Release notes for 0.16.1 (pydata#4435) Small updates to How-to-release + lint (pydata#4436) Fix doctests (pydata#4439) add a ci for doctests (pydata#4437) preserve original dimension, coordinate and variable order in ``concat`` (pydata#4419) Fix for h5py deepcopy issues (pydata#4426) Keep the original ordering of the coordinates (pydata#4409) Clearer Vectorized Indexing example (pydata#4433) Revert "Fix optimize for chunked DataArray (pydata#4432)" (pydata#4434) Fix optimize for chunked DataArray (pydata#4432) fix doc dataarray to netcdf (pydata#4424) ...

Miguel Jimenez-Urias added 30 commits April 22, 2020 23:59

create def for multiple zarr files and added commentary/definition, w…

f55ed1c

…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``

just as with xr.open_mfdatasets, identify the paths as local dire…

49f6512

…ctory paths/strings

added error if no path

f35a3e5

finished copying similar code from xr.open_mfdatasets

9f728aa

remove blank lines

8d0a844

fixed typo

b3b0f1d

added xr.open_mzarr() to the list of available modules to call

2221943

imported missing function

ac35e7c

imported missing glob

64654f3

imported function from backend.api

d5a5cef

imported function to facilitate mzarr

4c0ef19

correctly imported functions from core to mzarr

d158c21

imported to use on open_mzarr

5171420

removed lock and autoclose since not taken by open_zarr

e1e51bb

fixed typo

b6bf2cf

class is not needed since zarr stores don`t remain open

3bc4be8

removed old behavior

a79b125

set default

2d3bbb5

listed open_mzarr

f7cf580

removed unused imported function

53c8623

imported Path - hadn`t before

34d755e

remove unncessesary comments

b39b37e

modified comments

276006a

isorted zarr

6f04be6

isorted

aa97e1a

erased open_mzarr. Added capability to open_dataset to open zarr files

06de16a

removed imported but unused

f94fc9f

comment to zarr engine

16e08e3

added chunking code from open_zarr

22828fc

remove import open_mzarr`

021f2cc

dcherian requested a review from rabernat August 21, 2020 15:32

weiji14 added 2 commits August 26, 2020 13:39

Merge branch 'master' into open_mfzarr

4a0b922

Fix test by passing in chunk_store to backend_kwargs

cd783a5

shoyer reviewed Sep 11, 2020

View reviewed changes

weiji14 added 3 commits September 17, 2020 14:29

Merge remote-tracking branch 'origin/master' into open_mfzarr

2c28a98

Revert "Change open_zarr to open_dataset with engine="zarr" in io.rst"

7b34d1b

This reverts commit cd0b9ef.

Remove open_zarr DeprecationWarning

40c4d46

Partially reverts b488363.

alexamici mentioned this pull request Sep 17, 2020

Refactor of the big if-chain to a dictionary in the form {backend_name: backend_open}. #4431

Merged

5 tasks

shoyer approved these changes Sep 21, 2020

View reviewed changes

Merge branch 'master' into open_mfzarr

2c73e0b

weiji14 force-pushed the open_mfzarr branch from ca4e526 to 2c73e0b Compare September 21, 2020 05:44

weiji14 and others added 2 commits September 21, 2020 19:11

Update open_dataset docstring to specify chunk options for zarr engine

3b0e9b1

Let only chunks = None return non-chunked arrays

d4398d4

shoyer approved these changes Sep 21, 2020

View reviewed changes

weiji14 added 2 commits September 22, 2020 11:38

Remove for-loop in test_manual_chunk since testing only one no_chunk

da7baae

Update open_dataset docstring to remove mention of chunks=None with Zarr

48dae50

shoyer merged commit 5654aee into pydata:master Sep 22, 2020

weiji14 mentioned this pull request Sep 22, 2020

xarray.open_zarr to be deprecated intake/intake-xarray#70

Closed

keewis mentioned this pull request Sep 22, 2020

fix the sphinx build on RTD #4448

Merged

alexamici mentioned this pull request Sep 25, 2020

Cleanup logic inside open_dataset, mostly extra_kwargs #4462

Merged

3 tasks

This was referenced Oct 7, 2020

Remove maybe chunck duplicated function #4494

Merged

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Closed

aurghs mentioned this pull request Oct 29, 2020

WIP: Zarr chunks refactor #4550

Closed

5 tasks

aurghs mentioned this pull request Nov 19, 2020

WIP: Chunking refactor #4595

Closed

5 tasks

weiji14 mentioned this pull request Feb 19, 2023

deprecate open_zarr #7496

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray open_mfdataset with engine Zarr #4187

Xarray open_mfdataset with engine Zarr #4187

weiji14 commented Jun 30, 2020 •

edited by dcherian

Loading

shoyer commented Sep 11, 2020

shoyer Sep 11, 2020

weiji14 Sep 17, 2020 •

edited

Loading

shoyer Sep 11, 2020

weiji14 Sep 17, 2020 •

edited

Loading

shoyer Sep 21, 2020

weiji14 Sep 21, 2020

shoyer Sep 21, 2020

weiji14 Sep 21, 2020

shoyer Sep 21, 2020

weiji14 Sep 21, 2020

weiji14 Sep 21, 2020

shoyer Sep 21, 2020

shoyer Sep 21, 2020

weiji14 Sep 21, 2020

shoyer Sep 21, 2020

weiji14 Sep 21, 2020

shoyer commented Sep 22, 2020

martindurant commented Sep 22, 2020

dcherian commented Sep 22, 2020

Xarray open_mfdataset with engine Zarr #4187

Xarray open_mfdataset with engine Zarr #4187

Conversation

weiji14 commented Jun 30, 2020 • edited by dcherian Loading

shoyer commented Sep 11, 2020

Choose a reason for hiding this comment

weiji14 Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Sep 22, 2020

martindurant commented Sep 22, 2020

dcherian commented Sep 22, 2020

weiji14 commented Jun 30, 2020 •

edited by dcherian

Loading

weiji14 Sep 17, 2020 •

edited

Loading

weiji14 Sep 17, 2020 •

edited

Loading