Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Zarr v3 spec support #6475

Merged
merged 26 commits into from
Nov 27, 2022
Merged

implement Zarr v3 spec support #6475

merged 26 commits into from
Nov 27, 2022

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Apr 11, 2022

This is a WIP PR that is intended for use only with a development branch of Zarr (specifically zarr-developers/zarr-python#1006). I am using it to test the Zarr v3 spec support that is currently being added to zarr-python.

The primary changes needed were:

  • The v3 spec requires a path be specified when calling open_group or open_consolidated. This PR currently just sets a default group name of 'xarray' if one is not specified via the group kwarg to ZarrStore.open_group. I think that is convenient, but one could instead be stricter and raise an error in this case.
  • If a string corresponding to a filesystem path or URL is used for store, then it is not possible to infer which version of the zarr spec is desired. In this case, the user must specify zarr_version to choose the zarr protocol version. The default of zarr_version=None will infer the version from a zarr BaseStore subclass when possible, otherwise defaulting to zarr_version=2 for backwards compatibility.

The good news is that these changes are quite small overall. Most changed lines in the tests involve optionally passing zarr_version around so that we could test v3 support both with an explicit DirectoryStoreV3 store as well as with string-based paths.

Other points that need consideration in regards to the spec

  • a number of the tested data types including unicode strings, byte strings, complex floats, datetime arrays and structured arrays which are not part of the core v3 spec. We currently do implement these for the v3 spec in zarr-python in the same way they worked for v2, but the implementation is subject to change based on decisions around v3 protocol extensions related to these dtypes. A very rough initial draft of such extensions is at Add minimal drafts of a few more dtype extensions zarr-developers/zarr-specs#135.
  • dtype=str is used in some tests. Currently zarr-python uses a numcodecs filter VLenUTF8 in this case. The core zarr v3 spec no longer has a 'filter' entry as part of the metadata. A zarr v3 protocol extension needs to be defined to specify how this should be implemented. We do support this filter even for zarr v3 arrays currently, but it is done in a hacky way that needs to be standardized. This is the cause of the TODO comment around the call to attributes.pop('filters', None).

cc @joshmoore, @rabernat, @MSanKeys963

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@shoyer
Copy link
Member

shoyer commented Apr 13, 2022

  • The v3 spec requires a path be specified when calling open_group or open_consolidated. This PR currently just sets a default group name of 'xarray' if one is not specified via the group kwarg to ZarrStore.open_group. I think that is convenient, but one could instead be stricter and raise an error in this case.

Does Zarr v3 have a notion of a "root" group? That feels like a more sensible default to me, both for Xarray and Zarr-Python

  • If a string corresponding to a filesystem path or URL is used for store, then it is not possible to infer which version of the zarr spec is desired. In this case, the user must specify zarr_version to choose the zarr protocol version. The default of zarr_version=None will infer the version from a zarr BaseStore subclass when possible, otherwise defaulting to zarr_version=2 for backwards compatibility.

This sounds fine for now, but I am concerned that it will slow the adoption of Zarr v3. Eventually, we would presumably want to change the default to version 3, but this is difficult to do if it entirely breaks backwards compatibility.

My preference would be for the default behavior to try opening Zarr v2, and fall back to opening in v3 mode, even if this requires attempting to open a file from the store. This is similar to how Xarray handles other Zarr versioning issues (e.g., for consolidated metadata). Perhaps Zarr-Python could raise an informative error that we could catch if the Zarr version is incorrect, or even handle this behavior itself?

@grlee77
Copy link
Contributor Author

grlee77 commented Apr 14, 2022

Does Zarr v3 have a notion of a "root" group? That feels like a more sensible default to me, both for Xarray and Zarr-Python

I think we likely need to introduce a separate Hierarchy class as in the early zarrita python prototype and the xtensor-zarr C++ implementation to be able to access the root via public API. The concept of "hierarchy" as a collection of Nodes which are either Arrays or Groups is mentioned in the spec, but there is no corresponding class for this in zarr-python currently.

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

This sounds fine for now, but I am concerned that it will slow the adoption of Zarr v3. Eventually, we would presumably want to change the default to version 3, but this is difficult to do if it entirely breaks backwards compatibility.

We did define DEFAULT_ZARR_VERSION=2 (privately). If we update that variable to 3 in a future release, the default when zarr_version is not specified will change.

My preference would be for the default behavior to try opening Zarr v2, and fall back to opening in v3 mode, even if this requires attempting to open a file from the store. This is similar to how Xarray handles other Zarr versioning issues (e.g., for consolidated metadata). Perhaps Zarr-Python could raise an informative error that we could catch if the Zarr version is incorrect, or even handle this behavior itself?

Yeah, something like this seems feasible on the Zarr side for convenience routines like open_group. The v3 spec requires zarr.json metadata that specifies the protocol version. If we traverse up the directory tree and do not find this file, then it is not a valid v3 or later spec and we can try opening it as v2.

@shoyer
Copy link
Member

shoyer commented Apr 14, 2022

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

@shoyer
Copy link
Member

shoyer commented May 25, 2022

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

I opened up zarr-developers/zarr-python#1039

In this case where create_zarr_target returns a string, we must
specify zarr_version=3 when opening/writing a store to make sure
a version 3 store will be created rather than the default of a
version 2 store.
remove path='xarray' default for zarr v3

path=None should work as of Zarr v2.13
@grlee77
Copy link
Contributor Author

grlee77 commented Sep 21, 2022

sorry about the long delay here. This has been updated for the V3 store paths used in Zarr >v2.12 and to remove the need for specifying path with v3 stores.

To do:

A separate issue is that consolidated metadata isn't in the core Zarr v3 spec, so we will need to have a Zarr Enhancement Proposal to formally define how the metadata should be stored. In the experimental API, it behaves as for v2 and is stored at /meta/root/consolidated by default.

@joshmoore
Copy link

wait for zarr v2.13 release,

Done. And should be out on conda-forge later today.

@jhamman
Copy link
Member

jhamman commented Oct 18, 2022

A separate issue is that consolidated metadata isn't in the core Zarr v3 spec, so we will need to have a Zarr Enhancement Proposal to formally define how the metadata should be stored. In the experimental API, it behaves as for v2 and is stored at /meta/root/consolidated by default.

I think it would be fine to disallow consolidated metadata for v3 until there is a spec in place. This is going to be experimental for some time so I don't see the harm in raising an error when consolidated=True and version=3. I think this is better than guessing what the v3 extension will specify.

@jhamman
Copy link
Member

jhamman commented Oct 27, 2022

@grlee77 - I'm curious if you are planning to return to this PR or if it would be helpful if someone brought it to completion?

@grlee77
Copy link
Contributor Author

grlee77 commented Oct 28, 2022

I am happy for someone to take over if possible. Thank you.

@jhamman jhamman marked this pull request as ready for review October 29, 2022 02:42
@github-actions github-actions bot added the Automation Github bots, testing workflows, release automation label Oct 29, 2022
@jhamman
Copy link
Member

jhamman commented Nov 5, 2022

@grlee77, @rabernat, @joshmoore, and others - I think this is ready to review and/or merge. The Zarr-V3 tests are active in the CI Upstream / upstream-dev GitHub Action. The test failure on readthedocs is unrelated to this PR.

Copy link

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, @jhamman. 👍

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
@jhamman jhamman mentioned this pull request Nov 18, 2022
1 task
@dcherian
Copy link
Contributor

dcherian commented Nov 18, 2022

RTD failure is real.

/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/6475/doc/whats-new.rst:28: WARNING: Duplicate explicit target name: "whats-new.2022.11.1".

Otherwise, is this ready to merge?

@jhamman
Copy link
Member

jhamman commented Nov 19, 2022

This is ready to merge once #7300 is in.

@headtr1ck headtr1ck added the plan to merge Final call for comments label Nov 19, 2022
@dcherian
Copy link
Contributor

Thanks @grlee77 and @jhamman !

@dcherian dcherian merged commit 9973b6e into pydata:main Nov 27, 2022
dcherian added a commit to dcherian/xarray that referenced this pull request Dec 2, 2022
* upstream/main: (39 commits)
  Support the new compression argument in netCDF4 > 1.6.0 (pydata#6981)
  Remove setuptools-scm-git-archive, require setuptools-scm>=7 (pydata#7253)
  Fix mypy failures (pydata#7343)
  Docs: add example of writing and reading groups to netcdf (pydata#7338)
  Reset file pointer to 0 when reading file stream (pydata#7304)
  Enable mypy warn unused ignores (pydata#7335)
  Optimize some copying (pydata#7209)
  Add parse_dims func (pydata#7051)
  Fix coordinate attr handling in `xr.where(..., keep_attrs=True)` (pydata#7229)
  Remove code used to support h5py<2.10.0 (pydata#7334)
  [pre-commit.ci] pre-commit autoupdate (pydata#7330)
  Fix PR number in what’s new (pydata#7331)
  Enable `origin` and `offset` arguments in `resample` (pydata#7284)
  fix doctests: supress urllib3 warning (pydata#7326)
  fix flake8 config (pydata#7321)
  implement Zarr v3 spec support (pydata#6475)
  Fix polyval overloads (pydata#7315)
  deprecate pynio backend (pydata#7301)
  mypy - Remove some ignored packages and modules (pydata#7319)
  Switch to T_DataArray in .coords (pydata#7285)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automation Github bots, testing workflows, release automation io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants