Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatype for a 'shape specification' of a Dataset / DataArray #6680

Open
mjwillson opened this issue Jun 9, 2022 · 2 comments
Open

Datatype for a 'shape specification' of a Dataset / DataArray #6680

mjwillson opened this issue Jun 9, 2022 · 2 comments

Comments

@mjwillson
Copy link

mjwillson commented Jun 9, 2022

Is your feature request related to a problem?

Often with xarray I find myself having to create a template Dataset or DataArray with dummy data in it just to specify the dimensions/sizes/coordinates/variable names that are required in some situation.

Describe the solution you'd like

It would be very useful to have a datatype that represents a shape specification (dimensions, sizes and coordinates) independently of the data so that we can do things like:

  • Implement xarray equivalents of functions like np.ones, np.zeros, np.random.normal(size=...) that are given a shape specification which the return value should conform to. (I have some more sophisticated / less trivial examples of this too, functions which currently need to be given templates for the return value but only depend on the shape of the template)
  • Test if two DataArrays / Datasets have the same shape
  • Memoize or cache things based on shape (this implies the shape spec would need to be hashable)
  • Make it easier to use xarray with libraries like tree / PyTree that can be used to flatten and unflatten a Dataset into its underlying arrays together with some specification of the shape of the data structure that can be used to unflatten it back again. (Right now I have to implement my own shape specification objects to do this)
  • Manipulate shape specifications e.g. by adding or removing dimensions from them without having to manipulate dummy template data in slightly arbitrary ways (e.g. template.isel(dim_to_be_dropped=0, drop=True)) in order to do this.

Describe alternatives you've considered

I realise that using lazy dask arrays largely removes the performance overhead of manipulating fake data, but (A) it still feels kinda ugly and adds boilerplate to construct the fake data, and (B) not everyone wants to depend on dask.

Additional context

No response

@andersy005
Copy link
Member

@mjwillson, have you looked at https://github.com/carbonplan/xarray-schema? xarray-schema provides some of the functionality you are looking for...

@mjwillson
Copy link
Author

Thanks, that looks interesting, although sounds like it's addressing a slightly different problem; I'm not so much interested in validation of external inputs, more just in having some basic datatypes that can be used to specify dims/shape/coords as templates for DataArray / Datasets internally within my codebase.

Some things I'm looking for but don't appear to be supported:

  • Support for specifying coords
  • __hash__, __eq__ etc for the Schema objects
  • Convenient APIs to alter and combine these Schemas in similar ways to what can be done with DataArray / Datasets themselves, e.g. adding/removing dimensions, broadcasting against eachother etc -- perhaps mirroring APIs like expand_dims that can be used on DataArray / Dataset themself, to the extent this makes sense.

Feels to me that it would make sense to have these basic datatypes inside xarray, perhaps with something like xarray-schema providing extra validation helpers etc on top of them? But just my 2 cents :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants