Add zip_subtrees for paired iteration over DataTrees #9623

shoyer · 2024-10-15T06:49:36Z

This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined.

I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark:

In [1]: import xarray as xr

In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

In [3]: %timeit _ = list(tree.subtree)
# on main
87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# with this branch
55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Tests added

This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In contrast to `equals`, `identical` now also checks that any inherited variables are inherited on both objects. However, they do not need to be inherited from the same source. This aligns the behavior of `identical` with the DataTree `__repr__`. I've also removed the `from_root` argument from `equals` and `identical`. If a user wants to compare trees from their roots, a better (simpler) inference is to simply call these methods on the `.root` properties. I would also like to remove the `strict_names` argument, but that will require switching to use the new `zip_subtrees` (pydata#9623) first.

shoyer · 2024-10-15T16:04:35Z

I made a pass at re-implementing map_over_datasets using zip_subtree in shoyer#2

headtr1ck

Only some minor typing remarks, the rest looks good!

xarray/tests/test_treenode.py

headtr1ck · 2024-10-15T16:14:41Z

xarray/tests/test_treenode.py

        assert result == expected

+    def test_different_order(self):
+        first: NamedNode = NamedNode(


Why are these additional type hints required?
Can mypy not resolve it?

I think this should be fixed by changing the class definition to NamedNode(TreeNode[Tree])

mypy does seem to struggle here -- it raises an error about missing type annotations.

I have not been able to precisely reproduce the mypy setup from CI on my local machine, so I'm going to save this existing issue for someone else to look into.

I can only see the issue of not passing the generic type to the parent class like I said before. But ofc, we can keep this open by now.

xarray/core/treenode.py

headtr1ck · 2024-10-15T16:37:41Z

xarray/core/treenode.py

+        # iteration early
+        yield active_nodes
+
+        first_node = active_nodes[0]


Note, that theoretically it is possible to pass no arguments to this function. Then trees and here active_nodes is an empty tuple.

Maybe add something like this at the start:

if len(trees) < 2: yield trees return

Good point, I added an error to catch this case.

TomNicholas

so the result does not depend on the order in which child nodes are defined.

I'm not sure I understand this. Surely zip_subtrees is zipping nodes according to the order they appear in .children, which is the order they are defined?

TomNicholas · 2024-10-15T20:50:12Z

xarray/core/treenode.py

+        # https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode
+        queue = collections.deque([self])
+        while queue:
+            node = queue.popleft()
+            yield node
+            queue.extend(node.children.values())


Replacing the entire iterators.py file with 6 lines is so clever it's almost rude 🤣

shoyer · 2024-10-15T21:54:42Z

so the result does not depend on the order in which child nodes are defined.

I'm not sure I understand this. Surely zip_subtrees is zipping nodes according to the order they appear in .children, which is the order they are defined?

Let me restate this: zip_subtrees allows for zipping together multiple DataTree objects even if child nodes on different trees are defined in different orders, as long as the sets of each node's children match.

TomNicholas · 2024-10-15T22:20:59Z

And when you say "match" you mean the set of names of the children on tree A match the set of names of the children on tree B?

shoyer · 2024-10-15T23:04:11Z

And when you say "match" you mean the set of names of the children on tree A match the set of names of the children on tree B?

Exactly, matching is based on relative path from each root.

TomNicholas · 2024-10-16T01:25:57Z

matching is based on relative path from each root.

So I think this will basically create a breaking change relative to how map_over_subtree (and hence all arithmetic) used to work in xarray-contrib/datatree. In the old model the definition of isomorphic was such that you could actually multiply two trees with the same structure but differently-named nodes, e.g.

dt1 = DataTree.from_dict({'a': ..., 'a/b': ..., 'a/c': ...})
dt2 = DataTree.from_dict({'e': ..., 'e/f': ..., 'e/g': ...})

dt1 * dt2  # would return a tree with names following dt1

I was never really sure if that generality was actually necessary though.

Your new definition of isomorphic is sort of like having strict_names=True always, except that the as long as the set of names match then their names will be used to determine their corresponding nodes in the other tree, regardless of order.

Therefore if we're relaxing this then I think there is no longer any need to think of the data model of datatree as being an ordered set of children, as no mapping behaviour will depend on that order any longer.

I think it's okay to change this behaviour, including getting rid of the strict_names kwarg. After all I have never once needed the generality above! I just wanted to really spell out the implications to make sure I and anyone reading is following.

The last sentence of this section of the docs will need changing, and the change would merit an entry in the migration guide.

TomNicholas

Now that I think I understand the behaviour change here, the implementation is great! I really like the idea of zip_subtrees as a primitive.

shoyer · 2024-10-16T15:55:31Z

The last sentence of this section of the docs will need changing, and the change would merit an entry in the migration guide.

I'll do this in the next PR when I migrate over the map_over_datasets stuff.

* Updates to DataTree.equals and DataTree.identical In contrast to `equals`, `identical` now also checks that any inherited variables are inherited on both objects. However, they do not need to be inherited from the same source. This aligns the behavior of `identical` with the DataTree `__repr__`. I've also removed the `from_root` argument from `equals` and `identical`. If a user wants to compare trees from their roots, a better (simpler) inference is to simply call these methods on the `.root` properties. I would also like to remove the `strict_names` argument, but that will require switching to use the new `zip_subtrees` (#9623) first. * More efficient check for inherited coordinates

* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)

* main: (63 commits) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619) ...

shoyer requested a review from TomNicholas October 15, 2024 06:49

shoyer added 2 commits October 15, 2024 15:57

fix pytype error

23da8ca

Merge branch 'main' into zip_subtree

4480e11

shoyer mentioned this pull request Oct 15, 2024

Updates to DataTree.equals and DataTree.identical #9627

Merged

1 task

headtr1ck approved these changes Oct 15, 2024

View reviewed changes

headtr1ck added the topic-DataTree Related to the implementation of a DataTree class label Oct 15, 2024

shoyer added 2 commits October 15, 2024 13:45

Merge branch 'main' into zip_subtree

c22ff76

Tweaks per review

373886a

TomNicholas reviewed Oct 15, 2024

View reviewed changes

TomNicholas approved these changes Oct 16, 2024

View reviewed changes

shoyer merged commit 0c1d02e into pydata:main Oct 16, 2024
29 checks passed

TomNicholas mentioned this pull request Oct 18, 2024

Why do arithmetic operations between two datatrees depend on the order of subtrees? #9643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zip_subtrees for paired iteration over DataTrees #9623

Add zip_subtrees for paired iteration over DataTrees #9623

shoyer commented Oct 15, 2024

shoyer commented Oct 15, 2024

headtr1ck left a comment

headtr1ck Oct 15, 2024

headtr1ck Oct 15, 2024

shoyer Oct 15, 2024

headtr1ck Oct 15, 2024

headtr1ck Oct 15, 2024

shoyer Oct 15, 2024

TomNicholas left a comment

TomNicholas Oct 15, 2024

shoyer commented Oct 15, 2024

TomNicholas commented Oct 15, 2024

shoyer commented Oct 15, 2024

TomNicholas commented Oct 16, 2024

TomNicholas left a comment

shoyer commented Oct 16, 2024

Add zip_subtrees for paired iteration over DataTrees #9623

Add zip_subtrees for paired iteration over DataTrees #9623

Conversation

shoyer commented Oct 15, 2024

shoyer commented Oct 15, 2024

headtr1ck left a comment

Choose a reason for hiding this comment

headtr1ck Oct 15, 2024

Choose a reason for hiding this comment

headtr1ck Oct 15, 2024

Choose a reason for hiding this comment

shoyer Oct 15, 2024

Choose a reason for hiding this comment

headtr1ck Oct 15, 2024

Choose a reason for hiding this comment

headtr1ck Oct 15, 2024

Choose a reason for hiding this comment

shoyer Oct 15, 2024

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Oct 15, 2024

Choose a reason for hiding this comment

shoyer commented Oct 15, 2024

TomNicholas commented Oct 15, 2024

shoyer commented Oct 15, 2024

TomNicholas commented Oct 16, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

shoyer commented Oct 16, 2024