Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a perception of a __xarray__ magic method #8413

Open
swamidass opened this issue Nov 4, 2023 · 4 comments
Open

Add a perception of a __xarray__ magic method #8413

swamidass opened this issue Nov 4, 2023 · 4 comments

Comments

@swamidass
Copy link

Is your feature request related to a problem?

I am often moving data from external objects (of all sorts!) into xarray. This is a common use case

Much of this code would be greatly simplified if there was a way of giving non-xarray classes a way of declaring to xarray how these objects can be marshaled into

Describe the solution you'd like

So here is an initial proposal for comment. Much of this could be implemented in a third party library. But doing this in xarray itself would likely be best.

Magic Methods

It would be great to see these magic method signatures become integrated throughout the library:

___xarray__ -> xr.Dataset | xr.DataArray 
___xarray_array__ -> xr.DatArray 
___xarray_dataset__ -> xr.Dataset
___xarray_datatree__ -> xr.DataTree   # when DataTree is finally integrated into xarray

Conversion Registry

And these extension functions to register converters:

def register_xarray_converter(class, name: str, func : Callable[[class, ...] | None) -> xr.Dataset | xr.DataArray]:
  ...
def register_dataarray_converter(class, name: str, func : Callable[[class, ...] | None) ->  xr.DataArray:
  ...
def register_dataset_converter(class, name: str,  func : Callable[[class, ...] | None) -> xr.Dataset:
  ...
def register_datatree_converter(class, name: str,  func : Callable[[class, ...], xr.DataArray] | None) -> DataTree # when DataTree is finally integrated into xarray
  ...

Registering a converter if if cls implements a corresponding xarray*_ method or another converter already registered for cls. Perhaps add an argument that specifies if the converter should or should not be added if their is a clash. Perhaps these functions return the replaced converter so it can be added back in if needed?

Ideally, also, "deregister" versions (.e.g deregister would also be available. So context managers that change marshaling behavior could easily be constructed.

User API

Along with the following new user API functions:

def as_xarray(x, *args, **kwargs) -> xr.Dataset | xr.DataArray:
  ...
def as_dataarray(x,*args, **kwargs) -> xr.DataArray:
  ...
def as_dataset(x,*args, **kwargs) -> xr.DataSet:
  ...
def as_dataset(x,*args, **kwargs) -> xr.DataSet: # when DataTree is finally integrated into xarray
  ...

"as_xarray" returns (in order of precedence:

  • x unaltered if it is an xarray objects
  • registered_xarray_converter(x, *args, **kwargs) if it is callable and does not throw an exception
  • registered_dataarray_converter(x, *args, **kwargs) if it is callable and does not throw an exception
  • registered_dataarray_converter(x, *args, **kwargs) if it is callable and does not throw an exception
  • x.xarray(*args, **kwargs), if it exits, is callable, and does not throw an exception
  • x.xarray_dataset(*args, **kwargs), if it exists, is callable, and does not throw an exception
  • x.xarray_dataarray(*args, **kwargs), if it exists, is callable, and does not throw an exception
  • well known aliases of xarray_dataarray, such as x.to_xarray(*args, **kwargs) (see pandas)
  • [DESIGN DECISION] convert and return tuple[dims, data, [attr, encoding] to DataArray?
  • [DESIGN DECISION] convert and return tuple encoding of DataSet?
  • [DESIGN DECISION] return DataArray wrapped duck-typed array in DataArray?

The rationale for putting the registered functions first is that this would enable

"as_dataarrray" would be slimilar, but it would only call x.xarray_dataarray and well known aliases.

"as_dataset" would be slimilar, but it would only call x.xarray_dataset, well known aliases, and perhaps falling back to calling x.xarray_dataarray and converting the return a dataset if it has a name attribute.

"as_datatree" would be slimilar, but it would only call x.xarray_datatree, and perhaps falling back to calling x.xarray_dataarray and wrapping it in a single node datatree. (Though of course at this point this method would probably be implemented by the DataTree package, not xarray)

The design decisions are flexible from my point of view, and might be decided in a way that makes the code base simplest or most usable. There is also a question of whether or not this method should default the backup methods. These decisions also can be deferred entirely by delegating to the converter registry.

Across the Xarray Library

Finally, across the xarray library, there may be places where passing input arguments through as_xarray, as_dataarray, or as_dataset would make a lot of sense. This could be the final thing to do, but cannot be handled by a third party library.

Doing this would give give another pathway for third party libraries to integrate with xarray, with a far easier way than the converter registry or explicit calls to as_* functions.

Describe alternatives you've considered

This can be done with a private library. But it seems to a lot of code that is pretty useful to other use cases.

Most of this (but not all) can accomplished in a 3rd party library, but it wouldn't allow the seamless sort of integration with (for example) xarray use of repr_html to integrate with pandas.

The existing backend hooks work great when we are marshaling from file-based sources. See, for example, tiffslide-xarray (https://github.com/swamidasslab/tiffslide-xarray). This approach is seemless for reading files, but cannot marshal objects. For example, this is possible:

x = xr.open_dataset("slide.tiff")

But this doesn't work.

t = tiffslide.TiffSlide("slide.tiff")
x = xr.open_dataset(t) # won't work
x = xr.DataArray(t) # won't work either

This is an important use case because there are cases where we want to create an xarray like this from objects that are never stored on the filesystem.

Additional context

No response

Copy link

welcome bot commented Nov 4, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

Thanks for the issue @swamidass !

There is precedence for this — for example in Rust there are .from & .into methods that any struct can implement to convert into a struct from another library, and vice versa.

One thing this could interface with is subtyping — if a CustomDataset defined a form of __dataarray_slice__, then xarray could create a data array of that class when slicing. (#3980)

I agree that it could start as another library — or just in your own code initially — I don't think there's actually much need for this to be in xarray at the start. Probably the helpful thing here is to get feedback from others, and then coalesce on a standard over time.

@swamidass
Copy link
Author

The slice idea is a good one too.

Yup, asking for feedback.

Any idea if and when Datatree is gonna get rolled in?

@TomNicholas
Copy link
Member

Thanks for the interesting suggestion @swamidass!

I might be missing something, but what's the advantage of doing this over the other class just implementing a .to_xarray() method?

I'm also wondering if there any precedence for this pattern in Pandas? That might be useful to know as a similar prior example. I guess this suggestion is somewhat similar to pandas using apache arrow...?

Any idea if and when Datatree is gonna get rolled in?

When I get around to it / when I get some help 😅 If that's something you're interested in then that would be amazing. It honestly shouldn't actually be particularly hard, mostly just copy-pasting and making sure its all up to xarray code review standards. However this point it may make sense to wait for the NamedArray refactor to progress a bit further first though, see xarray-contrib/datatree#270.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants