Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray open_mfdataset with engine Zarr #4187
Xarray open_mfdataset with engine Zarr #4187
Changes from 97 commits
f55ed1c
49f6512
f35a3e5
9f728aa
8d0a844
b3b0f1d
2221943
ac35e7c
64654f3
d5a5cef
4c0ef19
d158c21
5171420
e1e51bb
b6bf2cf
3bc4be8
a79b125
2d3bbb5
f7cf580
53c8623
34d755e
b39b37e
276006a
6f04be6
aa97e1a
06de16a
f94fc9f
16e08e3
22828fc
021f2cc
985f28c
e8ed887
d693514
df34f18
98351c7
160bd67
7e57e9b
ac0f093
6a1516c
8999faf
5df0985
2d94ea2
377ef53
f48c84b
8376cca
aed1cc5
b488363
bae7f10
37ff214
b8b98f5
5c37329
831f15b
80dd7da
4ebf380
4ce3007
89a780b
6f6eb23
62893ab
13be3e0
1977ba1
afbcf78
cba93c3
bc740f6
746caa6
ef2a0f6
a7b24ff
6823721
3fa73a7
70fa30e
cb6d066
cd0b9ef
f2c368a
0530bba
738303b
38fc6f7
aca2012
0b34ab8
6fbeadf
6f6aae7
543a1c7
aa4833f
6b99225
5571fff
b9a239e
e9c35f9
827e546
1484a2a
ad6f31e
e2e1c81
dce4e7c
31ce87d
09bf681
47f1e32
da42dab
9b9dc3a
4a0b922
cd783a5
2c28a98
7b34d1b
40c4d46
2c73e0b
3b0e9b1
d4398d4
da7baae
48dae50
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this code is that introducing an entirely separate code-path inside
open_dataset()
for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a
get_chunk
method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error forchunks='auto'
? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I agree completely that the
open_dataset()
function is overdue for a refactor, it was a nightmare to go through all the if-then branches, but the comprehensive test suite helped to catch most of the bugs and I've tested it on my own real world dataset so the logic should be ok for now 🤞.Personally I would prefer to hold this off, since this
open_mfdataset
PR (and the previous one at #4003) has been sitting around for months, and I've had to resolve quite a few merge conflicts to keep up. No point in contaminating this complex PR by refactoring the NetCDF logic either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cam we make this
if chunks is None
instead?I know this is a discrepancy from how
open_zarr()
works today, but currentlyopen_dataset(..., chunks={})
is a way to open a dataset with dask chunks equal to the full size of any arrays.I doubt the (different) behavior of
open_zarr
in this case was intentional....There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try it locally and see if the tests break, felt like there was a reason it had to be
not chunks
but can't remember the context now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Not a big deal if we have to push off this clean-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, 2 tests would break with a
ZeroDivisionError
if we switch toif chunks is None
. Specifically:Related to my comment hidden in the mess above at #4187 (comment) 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would expect
test_manual_chunk
to fail here: it is explicitly verifying thatchunks=0
andchunks={}
result in-memory numpy arrays. Does it work if you remove those cases, e.g., by settingNO_CHUNKS = (None,)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, setting
NO_CHUNKS = (None,)
works withif chunks is None
. I'll make the change?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps by the way,
test_manual_chunk
withNO_CHUNKS = (None, 0, {})
was added in #2530.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can go ahead and change that. It doesn't look like that was carefully evaluated in #2530.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this code was more or less copied from
open_zarr
. The biggest thing to note is that users switching fromopen_zarr
toopen_dataset(..., engine="zarr")
will encounter a different defaultchunks
setting:open_zarr
defaults tochunks="auto"
open_dataset
default tochunks=None
What this means in practice, is that data will be returned as numpy arrays instead of dask arrays with the default
chunks=None
setting (for users who didn't explicitly set achunks
argument). If I'm not mistaken, this is because thetry: import dask.array
block isn't executed, and so dask isn't used by default.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one of our past meetings, the consensus was to keep
open_zarr
around withchunks="auto"
as default for backward compatibility.open_dataset(engine="zarr")
should use the current default forchunks
kwarg.With that in mind can we avoid adding this stuff here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with this 👌
Do you mean removing the
if chunks == "auto": ...
code block? I think it should be kept here inopen_dataset
(instead of putting it in theopen_zarr
function) because users might want to doopen_mfdataset(engine="zarr", chunks="auto")
, and this would require thechunks="auto"
logic to be inopen_dataset
.