DataTree should support Hashable names. #8836

flamingbear · 2024-03-14T22:37:09Z

What is your issue?

In porting xarray-contrib/datatree into pydata/xarray. We discovered some type mismatches.

The general feeling was that we should support Hashable in order to improve DataTree interactions with Dataset and DataArrays.

The quick solution of changing the name type to Hashable in NamedNode fails quickly because of it's PathPurePath inheritance.

This issue just tracks that we want to come back to this.

The text was updated successfully, but these errors were encountered:

shoyer · 2024-09-10T05:18:56Z

The other option worth considering is to deprecate non-string names on Dataset and DataArray. I'm sure this has come up for discussion before...

TomNicholas · 2024-09-10T16:22:54Z

👍 to removing non-string names - I think its been more trouble than its' worth...

max-sixty · 2024-09-10T18:18:41Z

What's the case for removing non-string names? My memory was we had had issues defining what exactly could be a key, but that these were mostly fixed, would be a lot of work to undo, and that many of the issues were around consistency rather than per-se non-string names...

TomNicholas · 2024-09-10T18:43:34Z

What's the case for removing non-string names?

Speaking strictly for DataTree, the basic model of DataTree can be thought of as Mapping[str, Dataset], where str is path-like. DataTree.from_dict is a pretty fundamental constructor that uses this pattern, and many methods are implemented by going back and forth between a Mapping[str, Dataset] representation and a linked DataTree objects representation.

This correspondence assumes that the names of children (groups) must be something that can be concatenated into path-like strings, using / as a separator. The NodePath object is very useful here, but inherits from pathlib.PurePosixPath, which assumes path segments are strings. If alternatively the names of children can be Hashable, concatenating these non-string segments produces node paths which aren't strings, and aren't so easily deconstructed as /<str>/<str>/<str/....

Okay so why don't we allow names of variables to be Hashable but names of children to be str? Well DataTree stores both variables and children on the same object so then instead of just

class DataTree:
    def __getitem__(self, key: str) -> DataArray | DataTree:
        ...

we would have

class DataTree:
    @overload
    def __getitem__(self, key: Hashable) -> DataArray:
        ...

    @overload
    def __getitem__(self, key: str) -> DataTree:
        ...

    def __getitem__(self, key: str | Hashable) -> DataArray | DataTree:
        ...

Finally these non-str names can rarely be serialized.

I think the "paths as concatenated names" is the actual problem, the rest are just things to work around.

max-sixty · 2024-09-10T19:10:54Z

I was predominantly asking about the case for forcing Datasets to have string keys — I would be quite hesitant about changing Dataset's types to align with DataTree's at this stage...

shoyer · 2024-09-10T20:30:50Z

Thinking about this a bit more, in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

TomNicholas · 2024-09-10T20:33:23Z

(I wrote this out before @shoyer's comment so I'm going to paste it anyway)

I was predominantly asking about the case for forcing Datasets to have string keys

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

I also still don't really understand what analyses you can do with names of variables / dims as Enum/tuple[str, ...] (apparently the most common non-str case) that you can't do with just str. But I would be happy to learn! (cc @headtr1ck )

Also if these types can't be serialized to netCDF / Zarr then that's an argument against allowing it to exist in-memory IMO.

I had forgotten about this proposal to use Generic-typed keys (#8199, #8276), but I'm unclear if that would solve the DataTree problem.

shoyer · 2024-09-10T20:35:15Z

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

I think this is the main argument. Making Hashable work properly adds a lot of complexity for very niche use-cases.

TomNicholas · 2024-09-10T20:39:48Z

Having said all that, going back to what to do in DataTree:

in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:

dt['/path/to/<str-variable-or-child-name>']

can be done without forcing the user to pass it all as a tuple:

dt[('/', 'path', 'to', <weird-non-str-variable-or-child-name>)]

There is a very interesting suggestion buried in a comment from 2018 #2292 (comment):

Some options that come to mind:

Allow any object with a __str__ method to be supplied as a variable/dimension label, but then delegate all internal sorting/printing/etc. logic to str(label).

This covers both of Enum/ tuple[str, ...], and I think this could also work for DataTree. i.e:

dt['/path/to/Enum('Red')']

or whatever.

Possibly with some added restriction around not including the / character (which could just be a runtime check specific to DataTree, see #9378). Or maybe its any type that can be coerced to pathlib.PurePosixPath?

TomNicholas · 2024-09-10T20:46:58Z

Historically, it doesn't seem like the discussion in #2292 was ever properly resolved. Adding in Hashable just went ahead without anyone involved answering the concerns raised there, or there being an explicit agreement on a decision. 🫤 There are multiple comments in there which anticipated problems we did run into with Hashable.

max-sixty · 2024-09-10T21:11:53Z

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Taking each of these in turn:

Align typing of dimension inputs #7094 — mostly because of ..., which we'd still have an issue with. Also Iterable vs Sequence, again no change. Not sure whether the subissue about None is because None is Hashable — possibly that's one which would be fixed with the change.
Inconsistent Type Hinting for dims Parameter in xarray Methods #8210 — I just added a comment that it seems to be an issue that mypy doesn't solve all problems and we can safely close
dimensions: type as str | Iterable[Hashable]? #6142 — I think unsolved by str | Iterable[str], given that those are also not disjoint; from my comment on the issue "So str | Iterable[Hashable] works when one is checked before the other, but not when they're required to be disjoint, like an overload."
Add public API for Dataset._copy_listed #3894 (comment) — I don't immediate see how this is solved by changing to str, seems like another where it's just "typing is painful", but on this one I'm likely wrong?
Support non-string dimension/variable names #2292 — I think this is solved, just added a comment to it

I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage. Maybe we find the DataTree typing is intractable with Hashable and the compatibility is more important given the increasing adoption of DataTree, but that will reveal itself with time.

Is that reasonable? Trying not to be defensive etc, tell me if I'm not sounding well-balanced here.

TomNicholas · 2024-09-10T22:19:10Z

Thanks for the gentle pushback @max-sixty ! (Do you want to make a reappearance in tomorrow's meeting? Would be great to see you there :) )

Taking each of these in turn:

Your responses are very reasonable, but I think there are still valid concerns in #2292 (comment) that haven't been addressed.

I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage.

Even if it was to change Hashable to e.g. str | Enum or some other more restricted non-str type? Because apparently that's what most users of this feature are actually using. That leaves only like a fraction of 1% of users...

Generally I'm only about 10% anti-Hashable in Dataset and about 90% pro any solution that allows us to consolidate code between Dataset and DataTree, without losing the neat path-like access to variables in DataTree that I mentioned above. I also obviously have a very strong bias here 😅

If we can make Hashable work in DataTree then let's do that! I would like to talk more about possible solutions to #8836 (comment).

shoyer · 2024-09-10T22:47:58Z

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Lol, quite possibly true!

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:
dt['/path/to/<str-variable-or-child-name>']

I think it would be fine not to support this syntax for non-string names. Syntax like dt.children[x].children[y].dataset[z] or dt[x][y][z] for hierarchical access should work for arbitrary hashable objects.

We might need a few more convenience APIs for the internal DataTree implementation (because dt.root[dt.path] is dt would not always be valid) but I think we could make it work. We could still display paths and allow them in __getitem__/__setitem__ for strings, but arbitrary hashables would be valid for accessing immediate children.

flamingbear added the needs triage Issue that has not been reviewed by xarray team member label Mar 14, 2024

TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 15, 2024

TomNicholas mentioned this issue Mar 26, 2024

Migrate datatree.py module into xarray.core. #8789

Merged

4 tasks

github-project-automation bot added this to DataTree integration Aug 27, 2024

github-project-automation bot moved this to To do in DataTree integration Aug 27, 2024

TomNicholas added the topic-typing label Sep 10, 2024

TomNicholas mentioned this issue Sep 10, 2024

Fix DataTree.coords.__setitem__ by adding DataTreeCoordinates class #9451

Merged

4 tasks

max-sixty mentioned this issue Sep 10, 2024

Support non-string dimension/variable names #2292

Closed

TomNicholas mentioned this issue Oct 19, 2024

Re-implement map_over_datasets using group_subtrees #9636

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTree should support Hashable names. #8836

DataTree should support Hashable names. #8836

flamingbear commented Mar 14, 2024

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024 •

edited

Loading

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024 •

edited

Loading

TomNicholas commented Sep 10, 2024

shoyer commented Sep 10, 2024

DataTree should support Hashable names. #8836

DataTree should support Hashable names. #8836

Comments

flamingbear commented Mar 14, 2024

What is your issue?

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024 • edited Loading

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

TomNicholas commented Sep 10, 2024

max-sixty commented Sep 10, 2024 • edited Loading

TomNicholas commented Sep 10, 2024

shoyer commented Sep 10, 2024

TomNicholas commented Sep 10, 2024 •

edited

Loading

max-sixty commented Sep 10, 2024 •

edited

Loading