Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_groups for zarr backends #9469

Merged
merged 8 commits into from
Sep 12, 2024
Merged

Conversation

eni-awowale
Copy link
Collaborator

@eni-awowale eni-awowale commented Sep 9, 2024

Part of #8572

xarray/backends/zarr.py Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Show resolved Hide resolved
@eni-awowale eni-awowale marked this pull request as ready for review September 9, 2024 23:13
@eni-awowale
Copy link
Collaborator Author

@TomNicholas TomNicholas added topic-backends topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class labels Sep 10, 2024
roundtrip_dict = open_groups(filepath, engine="zarr")
roundtrip_dt = DataTree.from_dict(roundtrip_dict)

assert open_datatree(filepath, engine="zarr").identical(roundtrip_dt)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.identical is going to be different to check if the coordinates are defined on disk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically see #9473.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so from looking at this and testing this locally I don't think this will affect these tests.

xarray/backends/zarr.py Show resolved Hide resolved
group_name = str(NodePath(path_group))
groups_dict[group_name] = group_ds

return groups_dict


def _iter_zarr_groups(root, parent="/"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to add type hints (though I don't know what the type of root is)

Suggested change
def _iter_zarr_groups(root, parent="/"):
def _iter_zarr_groups(root: ZarrGroup?, parent: str = "/") -> Iterable[str]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the appropriate TypeHint would be here that wouldn't break mypy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a play with this. I traced root back and found it is indeed a zarr.Group, so I think the type hint above works (along with adding from zarr import Group as ZarrGroup in the if TYPE_CHECKING block at the top of the file).

The bit that was causing an issue the assignment of a NodePath object to parent, which is a str from the function signature. (Apologies if I'm just playing catch up here from what you guys have been saying all along)

I could make mypy happy if I did the following in the function:

def _iter_zarr_groups(root: ZarrGroup, parent: str = "/") -> Iterable[str]:
    from xarray.core.treenode import NodePath

    parent_nodepath = NodePath(parent)
    yield str(parent_nodepath)
    for path, group in root.groups():
        gpath = parent_nodepath / path
        yield str(gpath)
        yield from _iter_zarr_groups(group, parent=gpath)

Maybe that's an option? (Unless I'm missing a trick with this breaking the recursion?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

(No need for from xarray.core.treenode import NodePath to live inside the function though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Owen! That did the trick!

Comment on lines 160 to 161
with pytest.raises(ValueError):
open_datatree(filepath)
open_datatree(unaligned_datatree_nc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely use

pytest.raises(ValueError, match=...):

to match a much more specific error. We are expecting this test to fail because of an alignment error, not any other type of ValueError.

(Maybe we should create a new exception AlignmentError(ValueError), given that we do already have MergeError(ValueError)? @shoyer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this but in the treenode.py module?

class AlignmentError(ValueError):
    """Raised when a child node has coordinates that are not aligned with its parents."""

If we do this I think we might have to change datatree.py to raise AlignmentError in
`_check_alignment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment is a core xarray concept separate of datatree. My suggestion was to change the ValueErrors raised in alignment.py to become AlignmentErrors, then datatree.py would automatically throw AlignmentError when _check_alignment is called (because it calls xr.align internally).

This suggestion is outside the scope of this PR though - you should go ahead with just using match to match a specific ValueError and then we can maybe do AlignmentError in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the last commit 🚀

@TomNicholas TomNicholas added the plan to merge Final call for comments label Sep 11, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eni-awowale !

Copy link
Contributor

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Nice work!

I looked at the type hint stuff, and might have an answer 🤞.

group_name = str(NodePath(path_group))
groups_dict[group_name] = group_ds

return groups_dict


def _iter_zarr_groups(root, parent="/"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a play with this. I traced root back and found it is indeed a zarr.Group, so I think the type hint above works (along with adding from zarr import Group as ZarrGroup in the if TYPE_CHECKING block at the top of the file).

The bit that was causing an issue the assignment of a NodePath object to parent, which is a str from the function signature. (Apologies if I'm just playing catch up here from what you guys have been saying all along)

I could make mypy happy if I did the following in the function:

def _iter_zarr_groups(root: ZarrGroup, parent: str = "/") -> Iterable[str]:
    from xarray.core.treenode import NodePath

    parent_nodepath = NodePath(parent)
    yield str(parent_nodepath)
    for path, group in root.groups():
        gpath = parent_nodepath / path
        yield str(gpath)
        yield from _iter_zarr_groups(group, parent=gpath)

Maybe that's an option? (Unless I'm missing a trick with this breaking the recursion?)

subgroup1_var[:] = np.array([[0.1, 0.2]])
with pytest.raises(
ValueError,
match=(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! (I know there's already been a conversation on using the match kwarg, but wanted to 👍. This is definitely a good way to go to make sure the right ValueError is being raised.)

xarray/tests/test_backends_datatree.py Show resolved Hide resolved
+ ".*"
),
):
open_datatree(unaligned_datatree_nc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the order of merging to expect, but if @flamingbear merges #9033 first, then we can clean up the imports and use xr.open_datatree. If it's the other way around, then it'll probably be best for one of us to take a quick skim over the tests and tidy them up in a separate PR.

My guess is that this PR will merge before #9033, so we'll likely end up with a separate tiny PR for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tidying up these imports isn't a big deal, and can be left for another PR

@TomNicholas TomNicholas merged commit aeaa082 into pydata:main Sep 12, 2024
28 checks passed
dcherian added a commit to dcherian/xarray that referenced this pull request Sep 17, 2024
* main: (26 commits)
  Forbid modifying names of DataTree objects with parents (pydata#9494)
  DAS-2155 - Merge datatree documentation into main docs. (pydata#9033)
  Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378)
  Ensure TreeNode doesn't copy in-place (pydata#9482)
  `open_groups` for zarr backends (pydata#9469)
  Update pyproject.toml (pydata#9484)
  New whatsnew section (pydata#9483)
  Release notes for v2024.09.0 (pydata#9480)
  Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451)
  Rename DataTree's "ds" and "data" to "dataset" (pydata#9476)
  Update DataTree repr to indicate inheritance (pydata#9470)
  Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460)
  Repo checker (pydata#9450)
  Add days_in_year and decimal_year to dt accessor (pydata#9105)
  remove parent argument from DataTree.__init__ (pydata#9465)
  Fix inheritance in DataTree.copy() (pydata#9457)
  Implement `DataTree.__delitem__` (pydata#9453)
  Add ASV for datatree.from_dict (pydata#9459)
  Make the first argument in DataTree.from_dict positional only (pydata#9446)
  Fix typos across the code, doc and comments (pydata#9443)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Sep 17, 2024
* main:
  Opt out of floor division for float dtype time encoding (pydata#9497)
  fixed formatting for whats-new (pydata#9493)
  Forbid modifying names of DataTree objects with parents (pydata#9494)
  DAS-2155 - Merge datatree documentation into main docs. (pydata#9033)
  Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378)
  Ensure TreeNode doesn't copy in-place (pydata#9482)
  `open_groups` for zarr backends (pydata#9469)
  Update pyproject.toml (pydata#9484)
  New whatsnew section (pydata#9483)
Comment on lines +165 to +167
re.escape(
"group '/Group1/subgroup1' is not aligned with its parents:\nGroup:\n"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be just an r string? Something like

Suggested change
re.escape(
"group '/Group1/subgroup1' is not aligned with its parents:\nGroup:\n"
)
r"group '/Group1/subgroup1' is not aligned with its parents:[\n]Group:[\n].*"

hollymandel pushed a commit to hollymandel/xarray that referenced this pull request Sep 23, 2024
* open groups zarr initial commit

* added tests

* Added requested changes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* TypeHint for zarr groups

* update for parent

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-backends topic-DataTree Related to the implementation of a DataTree class topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_groups function to open any zarr file containing nested groups
4 participants