-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray.open_mzar: open multiple zarr files (in parallel) #4003
Conversation
…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``
…ctory paths/strings
Hello @Mikejmnez! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-05-22 16:45:34 UTC |
I think the better way to do this would be to add a kwarg to This would then delegate to cc @pydata/xarray for thoughts. |
+1 for having |
I like this approach (add capability to open_mfdataset to open multiple zarr files), as it is the easiest and cleanest. I considered it, and I am glad this is coming up because I wanted to know different opinions. Two things influenced my decision to have
For netcdf-files ( url_path = fsspec.open_local(paths, *kwargs) which can interpret a glob path. Then zarr files ( url_path = fsspec.mapper(paths, *kwargs)
I am extremely interested what people think about |
moved definitions to zarr backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, first of all, thanks for this! I've been using an earlier version of this branch for a couple of weeks now (at 4ce3007) which works fantastically. Just noticed that the newer version at 62893ab was broken though. This is my 'bad' attempt at helping move things forward a bit. Fingers crossed that this gets into the next xarray
release 😄
if isinstance(chunks, int): | ||
chunks = dict.fromkeys(ds.dims, chunks) | ||
|
||
variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()} | |
variables = {k: store.maybe_chunk(k, v, chunks, overwrite_encoded_chunks) for k, v in ds.variables.items()} |
@@ -356,6 +358,48 @@ def encode_variable(self, variable): | |||
def encode_attribute(self, a): | |||
return encode_zarr_attr_value(a) | |||
|
|||
|
|||
def get_chunk(name, var, chunks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_chunk(name, var, chunks): | |
def get_chunk(self, name, var, chunks): |
chunk_spec[dim] = chunks[dim] | ||
return chunk_spec | ||
|
||
def maybe_chunk(name, var, chunks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def maybe_chunk(name, var, chunks): | |
def maybe_chunk(self, name, var, chunks, overwrite_encoded_chunks): |
if (var.ndim > 0) and (chunk_spec is not None): | ||
# does this cause any data to be read? | ||
token2 = tokenize(name, var._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (var.ndim > 0) and (chunk_spec is not None): | |
# does this cause any data to be read? | |
token2 = tokenize(name, var._data) | |
if (var.ndim > 0) and (chunk_spec is not None): | |
from dask.base import tokenize | |
# does this cause any data to be read? | |
token2 = tokenize(name, var._data) |
if isinstance(filename_or_obj, MutableMapping): | ||
if engine == 'zarr': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditions should be combined into a single if
-- otherwise somebody could provided filename_or_obj
as a mutable mapping with a different engine, which could result in undefined behavior.
if engine == 'zarr': | ||
# on ZarrStore, mode='r', synchronizer=None, group=None, | ||
# consolidated=False. | ||
overwrite_encoded_chunks = backend_kwargs.pop("overwrite_encoded_chunks", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pop
modifies a dict inplace. It's best to make a copy of backend_kwargs
first to ensure that objects users supply as arguments into xarray functions aren't unexpected modified.
@@ -383,6 +384,10 @@ def open_dataset( | |||
represented using ``np.datetime64[ns]`` objects. If False, always | |||
decode times to ``np.datetime64[ns]`` objects; if this is not possible | |||
raise an error. | |||
overwrite_encoded_chunks: bool, optional | |||
Whether to drop the zarr chunks encoded for each variable when a | |||
dataset is loaded with specified chunk sizes (default: False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add overwrite_encoded_chunks
as a new argument in ZarrStore.open_group
? That seems preferable to adding another highly backend specific method on to open_dataset
.
(To be honest, I'm not entirely sure why we need overwrite_encoded_chunks
... it's not obvious from the description why it's useful)
@Mikejmnez, do you mind if I pick up working on this branch? I'd be really keen to see it get into xarray 0.16, and then it will be possible to resolve the intake-xarray issue at intake/intake-xarray#70. |
Don't pop the backend_kwargs dict as per pydata#4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr".
@weiji14 could you kindly reopen your new pull request against the main xarray repository? Your pull request is currently in Mikejmnez/xarray |
Sure, I can move it, but I just wanted to make sure @Mikejmnez gets the credit for this PR. Edit: moved to #4187. |
Yes, absolutely! As long as you preserve his original commits and add yours on top of them, both of you will be credited in the Git history. If you're writing a release note in |
@weiji14 @shoyer Thanks you guys! Sorry it has taken me long to come back to this PR - I really mean to come back to this but I got stuck with another bigger PR that is actually part of my main research project. Anyways, much appreciated for the help, cheers!!
|
don't worry about that: we can close this PR when we merge #4187 |
* create def for multiple zarr files and added commentary/definition, which matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine`` * just as with ``xr.open_mfdatasets``, identify the paths as local directory paths/strings * added error if no path * finished copying similar code from `xr.open_mfdatasets` * remove blank lines * fixed typo * added ``xr.open_mzarr()`` to the list of available modules to call * imported missing function * imported missing glob * imported function from backend.api * imported function to facilitate mzarr * correctly imported functions from core to mzarr * imported to use on open_mzarr * removed lock and autoclose since not taken by ``open_zarr`` * fixed typo * class is not needed since zarr stores don`t remain open * removed old behavior * set default * listed open_mzarr * removed unused imported function * imported Path - hadn`t before * remove unncessesary comments * modified comments * isorted zarr * isorted * erased open_mzarr. Added capability to open_dataset to open zarr files * removed imported but unused * comment to `zarr` engine * added chunking code from `open_zarr` * remove import `open_mzarr`` * removed `open_mzarr`` from top-level-function * missing return in nested function * moved outside of nested function, had touble with reading before assignement * added missing argument associated with zarr stores, onto the definition of open_dataset * isort zarr.py * removed blank lines, fixed typo on `chunks` * removed imported but unused * restored conditional for `auto` * removed imported but unused `dask.array` * added capabilities for file_or_obj to be a mutablemapper such as `fsspec.get_mapper`, and thus compatible with `intake-xarray` * moved to a different conditional since file_or_obj is a mutablemapping, not a str, path or AbstractDataStore * isort api.py * restored the option for when file_or_obk is a str, such as an url. * fixed relabel * update open_dataset for zarr files * remove open_zarr from tests, now open_dataset(engine=`zarr`) * remove extra file, and raise deprecating warning on open_zarr * added internal call to open_dataset from depricated open_zarr * defined engine=`zarr` * correct argument for open_dataset * pass arguments as backend_kwargs * pass backend_kwargs as argument * typo * set `overwrite_enconded_chunks as backend_kwargs * do not pass as backend, use for chunking * removed commented code * moved definitions to zarr backends * Ensure class functions have necessary variables Was missing some 'self' and other kwarg variables. Also linted using black. * Combine MutableMapping and Zarr engine condition As per #4003 (comment). * Pop out overwrite_encoded_chunks after shallow copy backend_kwargs dict Don't pop the backend_kwargs dict as per #4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr". * Fix some errors noticed by PEP8 * Reorganize code in backends api.py and actually test using engine zarr Merge at 1977ba1 wasn't done very well. Reorganized the logic of the code to reduce the diff with xarray master, and ensure that the zarr backend tests actually have engine="zarr" in them. * Add back missing decode_timedelta kwarg * Add back a missing engine="zarr" to test_distributed.py * Ensure conditional statements make sense * Fix UnboundLocalError on 'chunks' referenced before assignment Need to pass in chunks to maybe_decode_store, to resolve UnboundLocalError: local variable 'chunks' referenced before assignment. * Run isort to fix import order * Fix tests where kwargs needs to be inside of backend_kwargs dict now Also temporarily silence deprecate_auto_chunk tests using pytest.raises(TypeError). May remove those fully later. * Change open_zarr to open_dataset with engine="zarr" in io.rst * Fix test_distributed by wrapping consolidated in backend_kwargs dict Patches cb6d066. * Ensure read-only mode when using open_dataset with engine="zarr" * Turn chunks from "auto" to None if dask is not available * Add back a missing else statement in maybe_chunk * Allow xfail test_vectorized_indexing when has_dask Instead of when not has_dask. * Typo on chunks arg in open_dataset * Fix ZeroDivisionError by adding back check that chunks is not False Yet another if-statement that wasn't properly transferred from zarr.py to api.py. * Fix a typo that was causing TypeError: 'method' object is not iterable * Move the `if not chunks` block to after auto detect Patches logic of 6fbeadf to fix errors when Dask is not installed. * Revert "Allow xfail test_vectorized_indexing when has_dask" This reverts commit aca2012. * Temporarily xfail test_vectorized_indexing with or without dask * Put zarr in open_mfdataset engine list * Test open_mfdataset_manyfiles with engine zarr Zarr objects are folders with seem to cause issues with closing, so added a try-except to api.py to catch failures in f.close(). Some tests failing when chunks=None because a numpy array is returned instead of a dask array. * Remember to set a ._file_obj when using Zarr Yet another logic error fixed, resolves the try-except hack in b9a239e. * Expect np.ndarray when using open_mfdataset on Zarr with chunks None * Add an entry to what's new for open_mfdataset with Zarr engine Plus a small formatting fix in open_mfdataset docstring * Make zarr engine's custom chunk mechanism more in line with ds.chunk Slightly edited the token name string to start with 'xarray' and include chunks in tokenize. Also replace the deprecated `_replace_vars_and_dims` method with just `_replace`. * Workaround problem where dask arrays aren't returned when chunks is None Revert 827e546 and workaround to get dask arrays by fixing some if-then logic in the code when `engine="zarr"` is involved. Things work fine when using chunks="auto", perhaps because the try `import dask.array` is needed to trigger loading into dask arrays? Also removed using chunks="auto" in some Zarr tests to simplify. * Default to chunks="auto" for Zarr tests to fix test_vectorized_indexing Revert hack in 6b99225 as test_vectorized_indexing now works on dask, specifically the negative slices test. It will still fail without dask, as was the behaviour before. Solution was to set `chunks="auto"` as the default when testing using `open_dataset` with `engine="zarr"`, similar to the default for `open_zarr`. Reverted some aspects of dce4e7c to ensure this `chunks="auto"`setting is visible throughout the Zarr test suite. * Fix test by passing in chunk_store to backend_kwargs * Revert "Change open_zarr to open_dataset with engine="zarr" in io.rst" This reverts commit cd0b9ef. * Remove open_zarr DeprecationWarning Partially reverts b488363. * Update open_dataset docstring to specify chunk options for zarr engine * Let only chunks = None return non-chunked arrays * Remove for-loop in test_manual_chunk since testing only one no_chunk * Update open_dataset docstring to remove mention of chunks=None with Zarr Co-authored-by: Miguel Jimenez-Urias <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>
This is associated with #3995 and somewhat mentioned in #3668. This is, emulating
xarray.open_mfdataset
, it allows xarray to read multiple zarr files from a glob or a list of files pointing to a zarr store. It follows the code forxarray.open_mfdataset
and the description to the function is the same, as they do the same thing but for respective files type.