-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinate inheritance for xarray.DataTree #9077
Comments
This is an incredible write-up of the question @shoyer ! This will be very useful for getting people's feedback on. Nit: The only part I think could be clearer is that IIUC the reason that the second image pyramid is invalid in the inherited coordinates model is because the |
Hello, First, thanks for the wonderful diagrams, they convey the message so well and help to reason about the topic! I have been working with the Zarr format and satellite imagery recently, as well as with the experimental datatree python package and I can provide you some feedback. I will start with an example as maybe it will explain my point of view better. I work with a data structure looking like this:
Each About Consistency vs Flexibility:Until now I favored Flexibility
Datasets already allow to share coordinates between variables. Until now, I did not did had a need for coordinate sharing on tree-level. I used DataTree as a way to organize collections of individually valid Datasets hierarchically, and for its nice interface to work with Zarr stores. While data duplication seems bad at first glance, it has also benefits: self-sufficient Zarr leaf groups that can be loaded as Datasets without the need of DataTree. I try (when possible) to use the simplest data structure available to represent my data. It means, favoring Datasets over DataTrees and favoring DataArrays over Datasets. So I have two options: open the full Zarr store as a DataTree, or only open the leaf group that I am interested in as a Dataset. Feedback
One of the selling points of xarray is as an universal data-reader. It can open many datasets without too many preconditions on those datasets. However, some limitations have existed, like the inability of xarray to write or read at once multiple NetCDF groups ( #1092 ). I understand that datatree aims to solve this family of problems. So I think it would be nice if xarray tried to do its best to open as many datasets as possible. Could we have both Consistency and Flexibility, with a scope mechanism? For instance, the right case of the third schema is invalid because no override is possible: the names In a flat Dataset, there is a single shared namespace for dimension names, thus a dimension name can only exist once (this is a limitation for some applications when you have matrices of the same dimension, you cannot have for instance
I don't think my Zarr stores would be impacted negatively, because all my coordinates are on leaves. I can safely reuse my Exemple:
About Zarr groupsAccessing a Zarr group inside of a Zarr directory without knowing about the other groups is a usecase where a Zarr store is more used as a database from which we only extract the necessary information, while being agnostic of the full structure. Like for a database query, we don't necessarily need to know about the complete schema if we only need to use a few tables. Also, if you load a Zarr group, it seemed to me that it could not go and fetch variables higher up in the hierarchy. If you load To reuse the schemas of the original post, maybe an user needs to access a Zarr database but only needs the About the memory footprint of coordinates duplicationIn my own experience, coordinates are usually lighter than data. For instance, for a 2-dimensional square raster image of side |
I tried to create an actual DataTree out of the example shown on the first schema, if it can help: DataTree creation codeimport pandas as pd
import numpy as np
import xarray as xr
from xarray.core import datatree as dt
xdt = dt.DataTree.from_dict(
name="(root)",
d={
"/": xr.Dataset(
coords={
"time": xr.DataArray(
data=pd.date_range(start="2020-12-01", end="2020-12-02", freq="D")[
:2
],
dims="time",
attrs={
"units": "date",
"long_name": "Time of acquisition",
},
)
},
attrs={
"description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
},
),
"/weather_data": xr.Dataset(
coords={
"station": xr.DataArray(
data=list("abcdef"),
dims="station",
attrs={
"units": "dl",
"long_name": "Station of acquisition",
},
)
},
data_vars={
"wind_speed": xr.DataArray(
np.ones((2, 6)) * 2,
dims=("time", "station"),
attrs={
"units": "meter/sec",
"long_name": "Wind speed",
},
),
"pressure": xr.DataArray(
np.ones((2, 6)) * 3,
dims=("time", "station"),
attrs={
"units": "hectopascals",
"long_name": "Time of acquisition",
},
),
},
attrs={"description": "Weather data node, inheriting the 'time' dimension"},
),
"/weather_data/temperature": xr.Dataset(
data_vars={
"air_temperature": xr.DataArray(
np.ones((2, 6)) * 3,
dims=("time", "station"),
attrs={
"units": "kelvin",
"long_name": "Air temperature",
},
),
"dewpoint_temp": xr.DataArray(
np.ones((2, 6)) * 4,
dims=("time", "station"),
attrs={
"units": "kelvin",
"long_name": "Dew point temperature",
},
),
},
attrs={
"description": (
"Temperature, subnode of the weather data node, "
"inheriting the 'time' dimension from root and 'station' "
"dimension from the Temperature group."
)
},
),
"/satellite_image": xr.Dataset(
coords={"x": [10, 20, 30], "y": [90, 80, 70]},
data_vars={
"infrared": xr.DataArray(
np.ones((2, 3, 3)) * 5, dims=("time", "y", "x")
),
"true_color": xr.DataArray(
np.ones((2, 3, 3)) * 6, dims=("time", "y", "x")
),
},
),
},
)
display(xdt)
print(xdt) DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0 |
Good suggestion, done!
This is definitely an alterantive worth considering! In my opinion, inheritance of coordinates with optional overrides would definitely be preferrable to a lack of any inheritance at all, because it reduces the need to have redundant coordinates. We can probably figure out a reasonable way to adjust the DataTree repr to be make it (somewhat) clear when a coordinate is overriden. That said, my preference would be overrides with strictly enforced alignment. The problem is that with optional overrides, you have no way of guaranteeing alignment between hierarchy levels when building a dataset. You could set |
So, if I understand correctly, the first time a dimension appears in hierarchy, it imposes its size for the rest of the nodes lower in the hierarchy. Overriding would be only possible if the coordinate variable downstream still has the same size as its parent. It means, we can still benefit from coordinate inheritance while benefiting from namespacing for groups at the condition that siblings coordinate variables are declared at the same level in the tree hierarchy (3rd schema on the left: any potential subgroups of the "parallel" zoom groups would herit the x and y coordinates of their zoom group). Then, what would be a concrete usecase of overriding a coordinate variable with same size and different coordinates ? I realize that I don't really see any need coordinate overriding in my usecase actually. The only issue would be that xarray cannot open all valid zarr files (but still way more than currently with sole Datasets!), and it can be problematic for people who read zarr stores that could potentially not be xarray-compatible ; but an application producing zarrs using xarray will always produce valid xarray-readable zarrs (that's really the most important to me). And it will not be a problem as xarray-compatible zarrs will just be a subset of all valid zarrs. What I mean is that someone using xarray as a tool to produce data will probably think primarily in terms of the in-memory data structures of xarray, rather than the actual file storage formats and their shenanigans, especially if multiple of these formats must be produced and xarray is the lowest common denominator of them all. |
In principle, we could also allow overriding dimension sizes lower in a hierarchy. I agree that I do not see a concrete use case for optional overrides, beyond allowing reading arbitrary hierarchical netCDF/Zarr files in Xarray. However, this would sacrifice a great deal of guaranteed consistency, and in my opinion would not be a worthwhile tradeoff. |
@etienneschalk it's a good question, but I think that calling Some other good points came up in today's meeting from @eni-awowale :
|
@TomNicholas do you mean |
I've been very excited about datatree's "suggested feature" to break away from enforcing consistent coordinates between data arrays. Inherited coordinates would allow us to have multiple different "x" and "y" coordinates that can come together in a single structure. Without the "inherited coordinate" flexibility, we would likely have to start looking elsewhere as our application complexity grows. Generally speaking, we have:
For now, our "solution" is to "prefix" all our "new data" with something like the flexibility of having different trees is the main advantage I see in creating a hierarchy. |
This is such an interesting and important question. Thanks @shoyer for opening the question up for discussion in such a clear manner! Not sure I have too much to add, but in my opinion, the inheritance of coordinate variables is a very intuitive concept, and one that is assumed by a lot of data producers. Lots of existing data sets (at least for NASA Earth Science data, my focus) are structured assuming inherited coordinate variables. Nevertheless, I think having the ability to override this to allow for more flexibility if needed may be a good way to approach it.
I second this point! |
@flamingbear I don't know what to call it, but the most literal name would be |
This has been great a conversation and it's been interesting hearing everyone's insights. I think this inheritance model truly captures the spirit of CF conventions. I do worry that that this model might be too restrictive for some ultra Quirky Data, unless we incorporate some additional functionality like Here is an example dataset that I think would be invalid against this model. The dimensions in the root group are
I like that I also want to add that I don't think we should necessarily make our decision based on one quirky collection at our DAAC. Though, I think this kind of quirkiness isn't unique to just GES DISC and NASA. Here is the data download link for anyone who is interested in looking at this granule. Edit: |
@TomNicholas Gotcha. I'm back on board. And I think this is an important feature. I've found one dataset that already would fail to validate, but I think the ability to open and see all the groups will be a useful feature. As well as the ability tweak the coords to be compliant before creating the tree. |
Your Question
TL;DR I'm for inheritance with overrides that can include differences in dimension size. Consistency that breaks the ability to read valid netCDF4 files does not strike me as a good idea, and not just (I hope!) because I'm also affiliated with a NASA DAAC. @eni-awowale said it quite clearly: "As a user, I like that I can just use open_datatree() to open any netCDF4 or he5 granule regardless of whether or not the data is CF compliant because often times it's not." The following example does not even violate CF Conventions, which I think is mum on inheritance (although the NUG is not). If I understand the consistency data model correctly, the following would generate a "test.nc" that
For those not used to working with I would want XArray to read this file into a My QuestionWhat would
Here the placement of the Thank you for posting the question for discussion! |
Hi all. I work closely with Matt, Owen and Eni, but I thought I would add my perspective here. And apologies if I am not up on all of the nuances here, but I definitely recommend flexibility over enforcement here, or at least if there is the possibility to allow the flexible option vs. enforcement - that should be considered. Unfortunately, as much as we want our data designers to adhere to best practices, we rarely get a chance to influence their decisions. There may be good reasons for their choices, but this kind of data integrity rarely gets considered. I've just responded to another new mission data design this morning (NISAR) which is choosing to put all of their dimension variables in a "sidecar" group off of root - as in their case, there are a number of dimensions being considered and it probably seemed too cluttered at the root level. We call these datasets "quirky data" - and while it is not always the case, there are just too many cases not to try to make things a bit easier for rectifying the data. These issues always take up more time and effort than is imagined or planned for. |
Thanks everyone for your input here! Here's what I'm hearing:
But also:
So tl;dr People would like inherited coordinates, but only if they can still easily read "quirky" datasets into xarray. From our point of view as developers, we strongly favour having a single global behaviour which provides strict contracts that we can rely on and reason about. This allows us to write clearer abstractions, develop quicker, have clearer documentation, and less chance of confusing behaviour. Given this we are strongly leaning towards coordinate inheritance with strict alignment checks at |
I think this all sounds good. A couple of thoughts I had looking at this file... Some 1-D coordinate variables have associated "bounds" variables per the CF-Conventions. This led me to two questions:
(Note - the example file above doesn't have the issues with the questions I'm mentioning, as all the variables are in a single group, and so there are no child nodes to pass things down to. I just wanted to give an example of what bounds might look like) |
Xarray has a policy of generally not special-casing behaviour to follow any specific metadata conventions. In particular:
So with this in mind...
They shouldn't be special-cased. So if they are coordinate variables with alignable dimensions then they will be propagated. I believe that will give sensible behaviour anyway - if the "bounds" variables are defined in the same group as the 1D coordinate variable they refer to then they should be accessible from anywhere that the 1D coordinates are.
Xarray explicitly doesn't do any detailed changes to metadata (see the same FAQ question linked above). IMO changing the metadata like this would fall outside the scope of the data model, as it would require xarray to understand that certain metadata fields are intended to be dynamic and group-dependent. The completely general case of what to do about metadata propagation is tracked in #1614. |
In my current prototype implementation, as long as the bounds variables are stored as "coordinates" they will be automatically propagated to child nodes, unless there is a conflicting coordinate with the same name on the child. This (coordinates that are not indexes) is the only case in which inheritance is not strict. |
Just wanted to mention the PR #9244 started at the SciPy sprints adds a test datatree fixture that reuses variable, dimension and coordinate names with different values and shapes, but adheres to the convention that allows inheritance since each group is at the same level. There are no variables/dims/coords in the root so nothing conflicts. This layout is representative of a convention used for inputs to some numerical metocean models. |
Thanks everyone for your comments. We implemented coordinate inheritance in #9063 |
What is your issue?
Should coordinate variables be inherited between different levels of an Xarray DataTree?
The DataTree object is intended to represent hierarchical groups of data in Xarray, similar to the role of sub-directories in a filesystem or HDF5/netCDF4 groups. A key design question is if/how to enforce coordinate consistency between different levels of a DataTree hierarchy.
As a concrete example of how enforcing coordinate consistency could be useful, consider the following hypothetical DataTree, representing a mix of weather data and satellite images:
Here there are four different coordinate variables, which apply to variables in the DataTree in different ways:
time
is a shared coordinate used by both weather and satellite variablesstation
is used only for weather variablesx
andy
are only use for satellite imagesIn this data model, coordinate variables are inherited to descendent nodes, which means that variables at different levels of a hierarchical DataTree are always aligned. Placing the
time
variable at the root node automatically indicates that it applies to all descendent nodes. Similarly,station
is in the baseweather_data
node, because it applies to all weather variables, both directly inweather_data
and in thetemperature
sub-tree. Accessing any of the lower level trees as anxarray.Dataset
would automatically include coordinates from higher levels (e.g.,time
).In an alternative data model, coordinate variables at every level of a DataTree are independent. This is the model currently implemented in the experimental DataTree project. To represent the same data, coordinate variables would need to be duplicated alongside data variables at every level of the hierarchy:
Which data model to prefer depends on which of two considerations we value more:
time
coordinates on the weather and satellite data to know that they are the same. Alignment, including matching coordinates and dimension sizes, is enforced by the data model.As a concrete example of what we lose in flexibility, consider the following two representations of an multiscale image pyramid, where each level of zoom has different x and y coordinates:
The version that places the base image at the root of the hierarchy would not be allowed in the inherited coordinates data model, because there would be conflicting x and y coordinates (or dimension sizes) between the root and child nodes. Instead, different levels of zoom would need to be placed under different groups (
zoom_1x
,zoom_2x
, etc).As we consider making this change to the (as yet unreleased) DataTree object in Xarray, I have two questions for prospective DataTree users:
xref: #9063, #9056
CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale
The text was updated successfully, but these errors were encountered: