Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.open_mzar: open multiple zarr files (in parallel) #4003

Closed
wants to merge 58 commits into from
Closed
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f55ed1c
create def for multiple zarr files and added commentary/definition, w…
Apr 23, 2020
49f6512
just as with ``xr.open_mfdatasets``, identify the paths as local dire…
Apr 23, 2020
f35a3e5
added error if no path
Apr 23, 2020
9f728aa
finished copying similar code from `xr.open_mfdatasets`
Apr 23, 2020
8d0a844
remove blank lines
Apr 23, 2020
b3b0f1d
fixed typo
Apr 23, 2020
2221943
added ``xr.open_mzarr()`` to the list of available modules to call
Apr 23, 2020
ac35e7c
imported missing function
Apr 23, 2020
64654f3
imported missing glob
Apr 23, 2020
d5a5cef
imported function from backend.api
Apr 23, 2020
4c0ef19
imported function to facilitate mzarr
Apr 23, 2020
d158c21
correctly imported functions from core to mzarr
Apr 23, 2020
5171420
imported to use on open_mzarr
Apr 23, 2020
e1e51bb
removed lock and autoclose since not taken by ``open_zarr``
Apr 23, 2020
b6bf2cf
fixed typo
Apr 23, 2020
3bc4be8
class is not needed since zarr stores don`t remain open
Apr 23, 2020
a79b125
removed old behavior
Apr 23, 2020
2d3bbb5
set default
Apr 23, 2020
f7cf580
listed open_mzarr
Apr 25, 2020
53c8623
removed unused imported function
Apr 25, 2020
34d755e
imported Path - hadn`t before
Apr 25, 2020
b39b37e
remove unncessesary comments
Apr 25, 2020
276006a
modified comments
Apr 25, 2020
6f04be6
isorted zarr
Apr 25, 2020
aa97e1a
isorted
Apr 25, 2020
06de16a
erased open_mzarr. Added capability to open_dataset to open zarr files
Apr 28, 2020
f94fc9f
removed imported but unused
Apr 28, 2020
16e08e3
comment to `zarr` engine
Apr 28, 2020
22828fc
added chunking code from `open_zarr`
Apr 28, 2020
021f2cc
remove import `open_mzarr``
Apr 28, 2020
985f28c
removed `open_mzarr`` from top-level-function
Apr 28, 2020
e8ed887
missing return in nested function
Apr 29, 2020
d693514
moved outside of nested function, had touble with reading before assi…
Apr 29, 2020
df34f18
added missing argument associated with zarr stores, onto the definiti…
Apr 29, 2020
98351c7
isort zarr.py
Apr 29, 2020
160bd67
removed blank lines, fixed typo on `chunks`
Apr 29, 2020
7e57e9b
removed imported but unused
Apr 29, 2020
ac0f093
restored conditional for `auto`
Apr 29, 2020
6a1516c
removed imported but unused `dask.array`
Apr 29, 2020
8999faf
added capabilities for file_or_obj to be a mutablemapper such as `fss…
Apr 29, 2020
5df0985
moved to a different conditional since file_or_obj is a mutablemappin…
Apr 29, 2020
2d94ea2
isort api.py
Apr 29, 2020
377ef53
restored the option for when file_or_obk is a str, such as an url.
Apr 29, 2020
f48c84b
fixed relabel
Apr 29, 2020
8376cca
update open_dataset for zarr files
Apr 29, 2020
aed1cc5
remove open_zarr from tests, now open_dataset(engine=`zarr`)
Apr 29, 2020
b488363
remove extra file, and raise deprecating warning on open_zarr
Apr 29, 2020
bae7f10
added internal call to open_dataset from depricated open_zarr
Apr 29, 2020
37ff214
defined engine=`zarr`
Apr 29, 2020
b8b98f5
correct argument for open_dataset
Apr 29, 2020
5c37329
pass arguments as backend_kwargs
Apr 29, 2020
831f15b
pass backend_kwargs as argument
Apr 29, 2020
80dd7da
typo
Apr 29, 2020
4ebf380
set `overwrite_enconded_chunks as backend_kwargs
Apr 29, 2020
4ce3007
do not pass as backend, use for chunking
Apr 29, 2020
89a780b
removed commented code
May 22, 2020
6f6eb23
moved definitions to zarr backends
May 22, 2020
62893ab
Merge pull request #1 from Mikejmnez/new_branch
May 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added default.profraw
Binary file not shown.
143 changes: 110 additions & 33 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,7 @@ def open_dataset(
drop_variables=None,
backend_kwargs=None,
use_cftime=None,
overwrite_encoded_chunks=False,
):
"""Open and decode a dataset from a file or file-like object.

Expand Down Expand Up @@ -345,7 +346,7 @@ def open_dataset(
If True, decode the 'coordinates' attribute to identify coordinates in
the resulting dataset.
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio', 'cfgrib', \
'pseudonetcdf'}, optional
'pseudonetcdf', 'zarr'}, optional
Engine to use when reading files. If not provided, the default engine
is chosen based on available dependencies, with a preference for
'netcdf4'.
Expand Down Expand Up @@ -383,6 +384,10 @@ def open_dataset(
represented using ``np.datetime64[ns]`` objects. If False, always
decode times to ``np.datetime64[ns]`` objects; if this is not possible
raise an error.
overwrite_encoded_chunks: bool, optional
Whether to drop the zarr chunks encoded for each variable when a
dataset is loaded with specified chunk sizes (default: False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this only applies to the zarr backend, would it make sense to pass this through backend_kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it does make sense to not make it an argument for open_dataset, and pass it to open_dataset through backend_kwargs from depricated open_zarr (i.e. zarr.py). However, it is not an argument for ZarrStore.open_group, instead it is used within def maybe_chunk. I had to pop-it and use it before passing backend_kwargs to open_group on zarr.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add overwrite_encoded_chunks as a new argument in ZarrStore.open_group? That seems preferable to adding another highly backend specific method on to open_dataset.

(To be honest, I'm not entirely sure why we need overwrite_encoded_chunks... it's not obvious from the description why it's useful)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from #2530 (comment), where overwrite_encoded_chunks was initially added.

It seems reasonable that anyone manually specifying chunks may want to rewrite the dataset in those chunks, and the error that arises when the encoded Zarr chunks mismatch the variable Dask chunks may quickly get annoying. overwrite_encoded_chunks=True sets the encoded chunks to None so there is no clash.

As mentioned by @Mikejmnez , overwrite_encoded_chunks is not actually used by ZarrStore.open_group, it is used at the open_dataset level. So it's not strictly a 'backend_kwarg', but we can still put it in there I suppose? Just looking through the code now to see how best to do it.



Returns
-------
Expand All @@ -409,7 +414,9 @@ def open_dataset(
"pynio",
"cfgrib",
"pseudonetcdf",
"zarr",
]

if engine not in engines:
raise ValueError(
"unrecognized engine for open_dataset: {}\n"
Expand Down Expand Up @@ -455,36 +462,7 @@ def maybe_decode_store(store, lock=False):

_protect_dataset_variables_inplace(ds, cache)

if chunks is not None:
from dask.base import tokenize

# if passed an actual file path, augment the token with
# the file modification time
if isinstance(filename_or_obj, str) and not is_remote_uri(filename_or_obj):
mtime = os.path.getmtime(filename_or_obj)
else:
mtime = None
token = tokenize(
filename_or_obj,
mtime,
group,
decode_cf,
mask_and_scale,
decode_times,
concat_characters,
decode_coords,
engine,
chunks,
drop_variables,
use_cftime,
)
name_prefix = "open_dataset-%s" % token
ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token)
ds2._file_obj = ds._file_obj
else:
ds2 = ds

return ds2
return ds

if isinstance(filename_or_obj, Path):
filename_or_obj = str(filename_or_obj)
Expand Down Expand Up @@ -519,7 +497,14 @@ def maybe_decode_store(store, lock=False):
store = backends.CfGribDataStore(
filename_or_obj, lock=lock, **backend_kwargs
)

elif engine == "zarr":
# on ZarrStore, mode='r', synchronizer=None, group=None,
# consolidated=False.
store = backends.ZarrStore.open_group(
filename_or_obj,
group=group,
**backend_kwargs
)
else:
if engine not in [None, "scipy", "h5netcdf"]:
raise ValueError(
Expand All @@ -542,7 +527,99 @@ def maybe_decode_store(store, lock=False):
if isinstance(filename_or_obj, str):
ds.encoding["source"] = filename_or_obj

return ds
if chunks is not None:
from dask.base import tokenize
if engine != 'zarr':

# if passed an actual file path, augment the token with
# the file modification time
if isinstance(filename_or_obj, str) and not is_remote_uri(filename_or_obj):
mtime = os.path.getmtime(filename_or_obj)
else:
mtime = None
token = tokenize(
filename_or_obj,
mtime,
group,
decode_cf,
mask_and_scale,
decode_times,
concat_characters,
decode_coords,
engine,
chunks,
drop_variables,
use_cftime,
)
name_prefix = "open_dataset-%s" % token
ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token)
ds2._file_obj = ds._file_obj

else: # file is zarr!

# adapted from Dataset.Chunk() and taken from open_zarr
if not isinstance(chunks, (int, dict)):
if chunks != "auto":
raise ValueError(
"chunks must be an int, dict, 'auto', or None. "
"Instead found %s. " % chunks
)
if isinstance(chunks, int):
chunks = dict.fromkeys(ds.dims, chunks)

if isinstance(chunks, tuple) and len(chunks) == len(ds.dims):
chunks = dict(zip(ds.dims, chunks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line is reachable. How could chunks ever be a tuple when line 561 checks against (int, dict)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, doesn't look like is reachable. I'll remove it.


def get_chunk(name, var, chunks):
chunk_spec = dict(zip(var.dims, var.encoding.get("chunks")))

# Coordinate labels aren't chunked
if var.ndim == 1 and var.dims[0] == name:
return chunk_spec

if chunks == "auto":
return chunk_spec

for dim in var.dims:
if dim in chunks:
spec = chunks[dim]
if isinstance(spec, int):
spec = (spec,)
if isinstance(spec, (tuple, list)) and chunk_spec[dim]:
if any(s % chunk_spec[dim] for s in spec):
warnings.warn(
"Specified Dask chunks %r would "
"separate Zarr chunk shape %r for "
"dimension %r. This significantly "
"degrades performance. Consider "
"rechunking after loading instead."
% (chunks[dim], chunk_spec[dim], dim),
stacklevel=2,
)
chunk_spec[dim] = chunks[dim]
return chunk_spec

def maybe_chunk(name, var, chunks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a slight preference to relocate maybe_chunk and get_chunk to the zarr.py module. Just skimming here but it doesn't seem like they are using local function state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense, and would make the code cleaner/easier to read and/or debug. You think it should be implemented within ZarrStore.open_group?

chunk_spec = get_chunk(name, var, chunks)

if (var.ndim > 0) and (chunk_spec is not None):
# does this cause any data to be read?
token2 = tokenize(name, var._data)
name2 = "zarr-%s" % token2
var = var.chunk(chunk_spec, name=name2, lock=None)
if overwrite_encoded_chunks and var.chunks is not None:
var.encoding["chunks"] = tuple(x[0] for x in var.chunk)
return var
else:
return var

variables = {k: maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}
ds2 = ds._replace_vars_and_dims(variables)
return ds2
else:
ds2 = ds

return ds2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we return in the block above, I think it would be okay to also return in line 620.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I am not so sure I completely understand. ds2 gets created if chunk is not None. If chunk is None then ds is passed, so we need ds2=ds.

This brings to a question: Can we just have

ds = ds.chunk(chunks, ...)

instead of

ds2 = ds.chunk(chunks,...)
ds2._file_obj = ds._file_obj

that way I could just return ds on line 620



def open_dataarray(
Expand Down