Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree should support Hashable names. #8836

Open
flamingbear opened this issue Mar 14, 2024 · 13 comments
Open

DataTree should support Hashable names. #8836

flamingbear opened this issue Mar 14, 2024 · 13 comments
Labels
topic-DataTree Related to the implementation of a DataTree class topic-typing

Comments

@flamingbear
Copy link
Member

What is your issue?

In porting xarray-contrib/datatree into pydata/xarray. We discovered some type mismatches.

The general feeling was that we should support Hashable in order to improve DataTree interactions with Dataset and DataArrays.

The quick solution of changing the name type to Hashable in NamedNode fails quickly because of it's PathPurePath inheritance.

This issue just tracks that we want to come back to this.

@flamingbear flamingbear added the needs triage Issue that has not been reviewed by xarray team member label Mar 14, 2024
@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 15, 2024
@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

The other option worth considering is to deprecate non-string names on Dataset and DataArray. I'm sure this has come up for discussion before...

@TomNicholas
Copy link
Member

👍 to removing non-string names - I think its been more trouble than its' worth...

@max-sixty
Copy link
Collaborator

What's the case for removing non-string names? My memory was we had had issues defining what exactly could be a key, but that these were mostly fixed, would be a lot of work to undo, and that many of the issues were around consistency rather than per-se non-string names...

@TomNicholas
Copy link
Member

What's the case for removing non-string names?

Speaking strictly for DataTree, the basic model of DataTree can be thought of as Mapping[str, Dataset], where str is path-like. DataTree.from_dict is a pretty fundamental constructor that uses this pattern, and many methods are implemented by going back and forth between a Mapping[str, Dataset] representation and a linked DataTree objects representation.

This correspondence assumes that the names of children (groups) must be something that can be concatenated into path-like strings, using / as a separator. The NodePath object is very useful here, but inherits from pathlib.PurePosixPath, which assumes path segments are strings. If alternatively the names of children can be Hashable, concatenating these non-string segments produces node paths which aren't strings, and aren't so easily deconstructed as /<str>/<str>/<str/....

Okay so why don't we allow names of variables to be Hashable but names of children to be str? Well DataTree stores both variables and children on the same object so then instead of just

class DataTree:
    def __getitem__(self, key: str) -> DataArray | DataTree:
        ...

we would have

class DataTree:
    @overload
    def __getitem__(self, key: Hashable) -> DataArray:
        ...

    @overload
    def __getitem__(self, key: str) -> DataTree:
        ...

    def __getitem__(self, key: str | Hashable) -> DataArray | DataTree:
        ...

Finally these non-str names can rarely be serialized.

I think the "paths as concatenated names" is the actual problem, the rest are just things to work around.

@max-sixty
Copy link
Collaborator

I was predominantly asking about the case for forcing Datasets to have string keys — I would be quite hesitant about changing Dataset's types to align with DataTree's at this stage...

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

Thinking about this a bit more, in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

@TomNicholas
Copy link
Member

TomNicholas commented Sep 10, 2024

(I wrote this out before @shoyer's comment so I'm going to paste it anyway)

I was predominantly asking about the case for forcing Datasets to have string keys

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

#7094

#8210

#6142

#3894 (comment)

#2292

I also still don't really understand what analyses you can do with names of variables / dims as Enum/tuple[str, ...] (apparently the most common non-str case) that you can't do with just str. But I would be happy to learn! (cc @headtr1ck )

Also if these types can't be serialized to netCDF / Zarr then that's an argument against allowing it to exist in-memory IMO.

I had forgotten about this proposal to use Generic-typed keys (#8199, #8276), but I'm unclear if that would solve the DataTree problem.

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

I think this is the main argument. Making Hashable work properly adds a lot of complexity for very niche use-cases.

@TomNicholas
Copy link
Member

Having said all that, going back to what to do in DataTree:

in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:

dt['/path/to/<str-variable-or-child-name>']

can be done without forcing the user to pass it all as a tuple:

dt[('/', 'path', 'to', <weird-non-str-variable-or-child-name>)]

There is a very interesting suggestion buried in a comment from 2018 #2292 (comment):

Some options that come to mind:

Allow any object with a __str__ method to be supplied as a variable/dimension label, but then delegate all internal sorting/printing/etc. logic to str(label).

This covers both of Enum/ tuple[str, ...], and I think this could also work for DataTree. i.e:

dt['/path/to/Enum('Red')']

or whatever.

Possibly with some added restriction around not including the / character (which could just be a runtime check specific to DataTree, see #9378). Or maybe its any type that can be coerced to pathlib.PurePosixPath?

@TomNicholas
Copy link
Member

Historically, it doesn't seem like the discussion in #2292 was ever properly resolved. Adding in Hashable just went ahead without anyone involved answering the concerns raised there, or there being an explicit agreement on a decision. 🫤 There are multiple comments in there which anticipated problems we did run into with Hashable.

@max-sixty
Copy link
Collaborator

max-sixty commented Sep 10, 2024

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Taking each of these in turn:


I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage. Maybe we find the DataTree typing is intractable with Hashable and the compatibility is more important given the increasing adoption of DataTree, but that will reveal itself with time.


Is that reasonable? Trying not to be defensive etc, tell me if I'm not sounding well-balanced here.

@TomNicholas
Copy link
Member

Thanks for the gentle pushback @max-sixty ! (Do you want to make a reappearance in tomorrow's meeting? Would be great to see you there :) )

Taking each of these in turn:

Your responses are very reasonable, but I think there are still valid concerns in #2292 (comment) that haven't been addressed.

I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage.

Even if it was to change Hashable to e.g. str | Enum or some other more restricted non-str type? Because apparently that's what most users of this feature are actually using. That leaves only like a fraction of 1% of users...


Generally I'm only about 10% anti-Hashable in Dataset and about 90% pro any solution that allows us to consolidate code between Dataset and DataTree, without losing the neat path-like access to variables in DataTree that I mentioned above. I also obviously have a very strong bias here 😅

If we can make Hashable work in DataTree then let's do that! I would like to talk more about possible solutions to #8836 (comment).

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Lol, quite possibly true!

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:

dt['/path/to/<str-variable-or-child-name>']

I think it would be fine not to support this syntax for non-string names. Syntax like dt.children[x].children[y].dataset[z] or dt[x][y][z] for hierarchical access should work for arbitrary hashable objects.

We might need a few more convenience APIs for the internal DataTree implementation (because dt.root[dt.path] is dt would not always be valid) but I think we could make it work. We could still display paths and allow them in __getitem__/__setitem__ for strings, but arbitrary hashables would be valid for accessing immediate children.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class topic-typing
Projects
Development

No branches or pull requests

4 participants