DataTree access to variables in parent groups #9056
Labels
API design
design question
enhancement
needs discussion
topic-DataTree
Related to the implementation of a DataTree class
Motivation
Accessing variables from parent groups in a tree would be useful. This has come up before in #1982 and xarray-contrib/datatree#297. Here I'm going to summarize some discussion from recent datatree meetings .
A use case is to have common coordinate variables between multiple sub-groups, for example this multi-resolution datatree has a
time
coordinate that conceptually is common to two groups:It would be useful to be able to access the
time
coordinate variable from either child group, i.e.dt['/high'].time
.Indeed, the CF conventions explicitly describe this type of behaviour, in terms of searching for variables outside of the current group
Problem
We could imagine changing the interface of
DataTree
to allow users to access any compatible variables on parent groups, where compatible means alignable.There are three issues with this:
.mean()
) over multiple nodes becomes really confusing, because copies of the same variable would effectively be present in multiple nodes.Proposal
Let me make a concrete feature proposal for discussion, which has some specific features:
Keep
.ds
,.__getitem__
etc. onDataTree
as-is. This means no breaking of backwards compatibility. This also means that we don't have to wait to implement all the details of this before releasing datatree in xarraymain
.A clear definition of "compatible variables" for inheritance. These are alignable variables that exist on a parent (or grandparent etc.) Q: Should these be just coordinate variables? Or all variables?
Add additional API which allows access to inherited variables, via a new
.inherit
accessor onDataTree
objects. (The name is not great, please feel free to suggest alternatives.)dt[...]
will never give access to inherited vars,dt.inherit[...]
would allow__getitem__
access to inherited varsdt.inherit.ds
would return aDatasetView
of that node with extra inherited variables in itdt.inherit.to_dataset()
->xr.Dataset
containing inherited varsdt.inherit()
? ->DataTree
Don't change
map_over_subtree
(again for backwards compatibility)map_over_inherited_subtree
isolates the conceptuals of mapping over tree with inherited variablesThis will be a new feature, to be done in a separate release (i.e. no blocker right now)
Implementation
dt.inherit
returns anInheritedNode
, which at construction time creates and caches a mapping of all inherited variables (._inherited_variables
). This then acts like a normalDataTree
node except that it consults the inherited variables instead of the normal list of variables.Creating the list of inherited variables is done by walking up the tree from the current node, examining new variables as they are encountered.
Q: Does this design handle coordinate names?
EDIT: Actually there's an even simpler idea:
ds.inherit
->DataTree
which has a shallow copy of all compatible variables inherited onto that node. Then.ds
,.__getitem__
etc. will automatically behave as expected, as you will just have a newDataTree
object with more valid keys.Describe alternatives you've considered
That's what we currently have, and with this proposal we could eventually remove it if it turned out no-one liked it.
dt.__getitem__
to access inherited variables)It's not possible to do this without breaking changes. It's also not clear that there is a general one-size-fits-all answer to when variables should or shouldn't be inherited. This proposal provides both behaviours.
Some kind of switch (on the specific object instances, globally, or with a context manager) could be used to switch between the two behaviours. But this seems extremely error-prone, and means that user code becomes ambiguous without knowing the state of the switch.
cc @shoyer @keewis @flamingbear @owenlittlejohns @eni-awowale
also @alexamici @benbovy I would love to hear your thoughts too.
The text was updated successfully, but these errors were encountered: