-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset groups #1092
Comments
I am reluctant to add the additional complexity of groups directly into the I would rather see this living in another data structure built on top of |
This suggestion has some significant overlap with the data store / data discovery discussion from last weekend: https://aospy.hackpad.com/Data-StorageDiscovery-Design-Document-fM6LgfwrJ2K |
Yep once again I haven't thought about all the implications this would have! This would indeed add much complexity at the end. I'll try to follow you suggestion of building another data structure, for example - correct me if it's a wrong approach too - a |
One important reason to keep the tree-like structure within a dataset is that it provides some assurance to the recipient of the dataset that all the variables 'belong' in the same coordinate space. Constructing a tree (from a nested dictionary, say) whose leaves are datasets or dataArrays doesn't guarantee that the coordinates/dimensions in all the leaves are compatible, whereas a tree within the dataset does make a guarantee about the leaves. As far as motivation for making trees, I find myself with several dozen variable names such as As far as implementation, the |
@lamorton Thanks for explaining the use case here. This makes more sense to me now. I like your idea of groups as syntactic sugar around flat datasets with named keys. With an appropriate naming convention, we might even be able to put this into
|
Just want to say that I'm very enthusiastic about this! Like @lamorton, I also find myself having a lot of variables with names containing the name(s) of their "group(s)". My initial idea was also to keep flat datasets and add some logic to get/set groups, but it wasn't very clear and well explained.
Makes perfect sense! I also find the idea of using tuples very clever! @shoyer do you have an idea on how it would work with serialization to netCDF? We would also have to decide how to display groups in the repr of the flat dataset... @lamorton @shoyer unless you want to open a PR, I'd be willing to start working on this. |
Would the domain for this just be to simulate the tree-like structure that NetCDF permits, or could it extend to multiple datasets on disk? One of the ideas that we had during the aospy hackathon involved some sort of idiom based on xarray for packing multiple, similar datasets together. For instance, it's very common in climate science to re-run a model multiple times nearly identically, but changing a parameter or boundary condition. So you end up with large archives of data on disk which are identical in shape and metadata, and you want to be able to quickly analyze across them. As an example, I built a helper tool during my dissertation to automate much of this, allowing you to dump your processed output in some sort of directory structure and consistent naming scheme, and then easily ingest what you need for a given analysis. It's actually working great for a much larger, Monte Carlo set of model simulations right now (3 factor levels with 3-5 values at each level, for a total of 1500 years of simulation). My tool works by concatenating each experimental factor as a new dimension, which lets you use xarray's selection tools to perform analyses across the ensemble. You can pre-process things before concatenating too, if the data ends up being too big to fit in memory (e.g. for every simulation in the experiment, compute time-zonal averages before concatenation). Going back to @shoyer's comment, it still seems as though there is room to build some sort of collection of |
@darothen you might be interested by the discussion we had here, although it doesn't solve anything related to selection across similar Dataset objects. I think that the collection of Both approaches may co-exist, though. I can imagine the case where we have (1) a set of, e.g., grid-search or monte-carlo model runs and (2) for each model run we have diagnostic variables defined in different places on the grid (e.g., nodes, edges...). The tuple-defined groups within a Dataset is useful for 2 and the collection of Dataset objects is useful for 1. As pointed out by @shoyer, such a collection of Dataset objects might be (preferably) implemented outside of xarray. |
Ah, thanks for the heads-up @benbovy! I see the difference now, and I agree
both approaches could co-exist. I may play around with building some of
your proposed `DatasetNode` functionality into my `Experiment` tool.
|
With netCDF4, we could potentially just use groups. Or we could use some sort of naming convention for strings, e.g., joining together together the parts of the tuple with One challenge here is that unless we also let dimensions be group specific, not every netCDF4 file with groups corresponds to a valid xarray Dataset: you can have conflicting sizes on dimensions for netCDF4 files in different groups. In principle, it could be OK to use tuples for dimension names, but we already have lots of logic that distinguishes between single and multiple dimensions by looking for non-strings or tuples. So you would probably have to write How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.
Some sort of further indentation seems natural, possibly with truncation like
This is another case where an HTML repr could be powerful, allowing for clearer visual links and potentially interactive expanding/contracting of the tree.
From xarray's perspective, there isn't really a distinction between multiple files and groups in one netCDF file -- it's just a matter of creating a Dataset with data organized in a different way. Presumably we could write helper methods for converting a dimension into a group level (and vice-versa). But it's worth noting that there still limitations to opening large numbers of files in a single dataset, even with groups, because xarray reads all the metadata for every variable into memory at once, and that metadata is copied in every xarray operation. For this reason, you will still probably want a different datastructure (convertible into an xarray.Dataset) when navigating very large datasets like CMIP, which consists of many thousands of files. |
@darothen: Hmm, are your coordinate grids identical for each simulation (ie,
It might work for my case to convert my 'tags' to indexes for new dimensions (ie,
There is still a good reason to have a flexible data model for lumping more heterogeneous collections together under some headings, with the potential for recursion. I suppose my question is, what is the most natural data model & corresponding access syntax?
@shoyer: Your approach is quite clever, and 'smells' much better than parsing strings. I do have two quibbles though.
[Edited for formatting] |
Yes, totally agreed, and I've encountered similar cases in my own work. These sort of "ragged" arrays are great use case for groups.
Yes, it's a little confusing because it looks similar to
Yes, it would create a new dataset, which could take ~1 ms. That's slow for inner loops (though we could add caching to help), but plenty fast for interactive use. |
@shoyer I see your point about the string manipulation. On the other hand, this is exactly how h5py and netCDF4-python implement the group/subgroup access syntax: just like a filepath. I'm also having thoughts about the attribute access: if For my own understanding, I tried to translate between
From netCDF4-python
It appears that the only things special about a
A big difference between The As an aside, it seems that ragged arrays are now supported in netCDF4-python:VLen. |
Yes, this is correct. But note that
Yes, this is true. We would possibly want to make another Dataset subclass for the sub-datasets to ensure that their variables are linked to the parent, e.g., But I'm also not convinced this is actually worth the trouble given how easy it is to write NumPy has similar issues, e.g., |
I would be +1 for allowing tuples for data variables names but not for dimensions/coordinates names. It indeed looks like that using tuples for the latter would be a greater source of confusion and would add too much complexity for only little (or no real?) benefit. I'd be fine with raising an error when loading a netCDF4 file which have groups with conflicting dimensions or when assigning an incompatible Dataset as a new group (e.g., For groups that share common dimensions/coordinates with some differences, a data structure built on top of |
I'm late to the discussion and may be repeating some things essentially already said, but I'd still like to add a further voice. @shoyer said on 8 Nov 2016:
If you prepend the paths to all the names (of dimensions, coordinate variables, and variables) and use the resulting strings as names, don't you just get a collection that would fit right in a My use case is data from a single metmast over time. There are various instruments measuring all kinds of variables of which 10-minute statistics are recorded. I use groups to keep an overview. (I use something like @shoyer said on 30 Mar 2017:
I would prefer the former option, as it more clearly shows the hierarchical nature. If also copying the netCDF4-path-separator-convention, then |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
I did a ctrl-f for zarr in this issue, found nothing, so here's my two cents: it should be possible to write a Datagroup with either zarr or netcdf. |
Hey Folks, However, one point I didn't see in the discussion is the following: Hierarchical structures often force a user to come up with some arbitrary order of hierarchy levels. The classical example is document filing: do you put your health insurance documents under One solution to that is a tagging of documents instead of putting them into a hierarchy. This would give the full flexibility to retrieve any flat Back to the above example, one could think of stuff like: # get a flat view (DataSet-like object) on all arrays of tagged that have the 'count' tag
ds: DataSet(View) = tagged.tag_select("count")
bar1 = ds.mean(dim="foo")
# get a flat view (DataSet-like object) on all arrays of tagged that have the "train and "controlled" tag
bar2 = tagged.tag_select("train", "controlled").mean(dim="foo") # order of arguments to `tag_select` is irrelevant! I hope it is clear what I mean, I know that there is e.g. some awesome file system plugins (he has incredibly nice high level documentation on the topic) that use such a data model. Just wanted to add that aspect to the discussion even if it might collide with the hierarchical approach! One side note: If every array in the tagged container has exactly one tag, and tags do not repeat, then the whole thing should be semantically identical to a Regards, Martin |
There's a parallel discussion hierarchical storage going on over in #4118. I'm going to close this issue in favor of the other one just to keep the ongoing discussion in one place. |
EDIT: see #4118 for ongoing discussion
Probably it has been already suggested, but similarly to netCDF4 groups it would be nice if we could access
Dataset
data variables, coordinates and attributes via groups.Currently xarray allows loading a specific netCDF4 group into a
Dataset
. Different groups can be loaded as separateDataset
objects, which may be then combined into a single, flatDataset
. Yet, in some cases it makes sense to represent data as a single object while it would be convenient to keep some nested structure. For example, aDataset
representing data on a staggered grid might havescalar_vars
andflux_vars
groups. Here are some potential uses for groups. When there are a lot of data variables and/or attributes, it would also help to have a more concise repr.I think about an implementation of
Dataset.groups
that would be specific to xarray, i.e., independent of any backend, and which would easily co-exist with the flatDataset
. It shouldn't be required for a backend to support groups (some existing backends simply don't). It is up to each backend to eventually transpose theDataset.groups
logic to its own group logic.Dataset.groups
might return aDatasetGroups
object, which quite similarly toxarray.core.coordinates.DatasetCoordinates
would (1) have a reference to the Dataset object, (2) basically consist of a Mapping of group names to data variable/coordinate/attribute names and (3) dynamically create anotherDataset
object (sub-dataset) on__getitem__
. Keys ofDataset.groups
should be accessible as attributes , e.g.,ds.groups['scalar_vars'] == ds.scalar_vars
.Questions:
inplace=True
?The text was updated successfully, but these errors were encountered: