Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duck array compatibility meeting #5648

Open
TomNicholas opened this issue Jul 29, 2021 · 31 comments
Open

Duck array compatibility meeting #5648

TomNicholas opened this issue Jul 29, 2021 · 31 comments
Labels
topic-arrays related to flexible array support upstream issue

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jul 29, 2021

Proposal: hold a high-level inter-library meeting to sort out roadblocks in the duck-array wrapping efforts.

Whilst trying to get dask, pint and xarray all working nicely together, I couldn't help but notice there are important issues which conclude with a shared sentiment that "we just need to make a decision as to what wraps what" but since then have had essentially no codified consensus, and hence no progress for the past year. Multiply-nested duck-array wrapping is complicated and involves a lot of separate libraries (as this graph of potential wrappings shows), but could be an amazingly powerful feature!

image

I suggest that as asynchronous discussion hasn't moved this forward, we should instead hold a (hopefully one-off) meeting to make these high-level design decisions.

I'm happy to arrange the meeting, but for this to work we ideally need attendees who understand the issues from the perspective of each of the main libraries involved - some suggestions:

Possible Agenda (please suggest additions!):

  • Which libraries should wrap which other libraries
  • Repo/NEP/etc. for standardizing wrapping order and other future decisions
  • Outstanding issues to tackle first

Background reading

Some related issues (there are many more - please add)

@TomNicholas TomNicholas added upstream issue topic-arrays related to flexible array support labels Jul 29, 2021
@keewis
Copy link
Collaborator

keewis commented Jul 29, 2021

cc @hameerabbasi for sparse, and @amcnicho might also be interested

@mrocklin
Copy link
Contributor

I would be happy to attend and look forward to what I'm sure will be a vigorous discussion :) Thank you for providing convenient links to reading materials ahead of time.

As a warning, my responsiveness to github comments these days is not what it used to be. If I miss something here then please forgive me.

@hameerabbasi
Copy link

I'd also be happy to attend. Keep in mind I'm in the CET timezone.

@rgommers
Copy link

I'm happy to join, seems interesting. And yes, I can say something about PyTorch. There probably isn't much to say though - PyTorch is unlikely to adopt __array_function__ at this point, just like JAX. And it doesn't seem critical for this hierarchy anyway - the fundamental array objects (PyTorch/CuPy/NumPy/Sparse/JAX arrays or tensors) do not have or need a class hierarchy, they are all at the bottom and should not be mixed.

The key thing here seems to be Dask <-> Xarray <-> Pint, unless I'm missing something?

@jacobtomlinson
Copy link
Contributor

Happy to attend. It might also be useful to have @pentschev involved too.

@TomNicholas
Copy link
Member Author

Thanks to everyone ho has expressed interest already! I'll give some more time for responses before starting to think about potential meeting times.

There probably isn't much to say though - PyTorch is unlikely to adopt array_function at this point, just like JAX.

Interesting - could you say a bit more? Looking at these two issues, it seemed more like the question was simply on hold until someone who wanted it badly enough came along?

jax-ml/jax#1565

pytorch/pytorch#22402

(I'm interested because I'm in the set of people who have an ambition to use these libraries (wrapped in xarray) at some point further down the line, but that might remain just an ambition 😅)

The key thing here seems to be Dask <-> Xarray <-> Pint, unless I'm missing something?

You're right, that was the primary motivation for this issue, but it also seemed like a good opportunity to get as many libraries talking (and their relationships codified) as possible, especially if one of the outcomes ends up being a type casting hierarchy resource/NEP.

@jthielen
Copy link
Contributor

jthielen commented Jul 30, 2021

Count me in for the meeting!


Here are a few suggestions about possible topics to add to the agenda (based on linked issues/discussions), if we can fit it all in:

  • Canonical/minimal API of a "duck array" and how to detect it (though may be superseded by NEPs 30 and 47 among others)
  • Consistency of type deferral (i.e., between construction, binary ops, __array_ufunc__, __array_function__, and array modules...for example, these are uniform in Pint, but construction and array module functions are deliberately different from the others for Dask arrays)
  • API for inter-type casting and changing what types are used in a nested array (e.g. sparse and other duck array issues #3245 and Add to_numpy() and as_numpy() methods #5568)
  • How to handle unknown duck arrays
  • Nested array reprs (both short and full)
  • Best practices for "carrying through" operations belonging to wrapped types (i.e., doing Dask-related things to a Pint Quantity or xarray DataArray that contains a Dask array), even if multiple layers deep

Also, tagging a few other array type libraries and maintainers/contributors who may be interested (please ping the relevant folks if you know them):

(interesting side note is the first three of these are all ndarray subclasses right now...perhaps discussing the interplay between array subclassing and wrapping is in order too?)

@benbovy
Copy link
Member

benbovy commented Jul 30, 2021

@TomNicholas
Copy link
Member Author

NumPy masked arrays (??)

Tagging @greglucas because of his work on numpy/numpy#16022 (comment) - at least I think a very common desired use case of multi-nested arrays would be xarray(pint(dask(np.masked)).

@pentschev
Copy link
Contributor

I'm also happy to join the meeting. Thanks @TomNicholas for the initiative here and @jacobtomlinson for tagging me.

@greglucas
Copy link

Happy to hop on a call for this as well, thanks for organizing all of this @TomNicholas!

@jpivarski
Copy link

I'm interested. Let us know when the time will be or if there's a poll for picking a time. Thanks!

@rgommers
Copy link

Interesting - could you say a bit more? Looking at these two issues, it seemed more like the question was simply on hold until someone who wanted it badly enough came along?

There is a significant backwards compatibility break when a library adds __array_ufunc__ and __array_function__. JAX maintainers were not comfortable with that. @shoyer wrote https://numpy.org/neps/nep-0037-array-module.html as a follow-up largely because of that. That NEP is effectively superceded by the array API standard (https://data-apis.org/array-api/latest/ and NEP 47). PyTorch has decided to adopt that and implementation is well underway. Experimental support in NumPy will land next week (complete except for linalg). CuPy and MXNet plan to follow that NumPy implementation. JAX and Dask not yet decided I believe, but likely to follow NumPy as well.

Canonical/minimal API of a "duck array" and how to detect it (though may be superseded by NEPs 30 and 47 among others)

This is basically what the array API standard provides (most functions follow NumPy, with deviations mostly where other libraries were already deviating because they could implement something on GPU, with a JIT, or with a non-strided memory model). __array_function__ has worked quite well for CuPy and Dask, because they follow the NumPy API almost to the letter, with only a couple of exceptions (e.g. 0-D array instead of array scalars in CuPy). JAX, PyTorch and MXNet all deviate much more, and since the NumPy API is not very well-defined (there's 1500+ public objects plus more semi-public ones), you'd have no guarantees about what works and what doesn't.

That said, __array_ufunc__ and __array_function__ aren't going anywhere. The RAPIDS ecosystem is invested in it and I believe largely happy with it. So adding Xarray and Pint to the mix sounds potentially interesting.

@SimonHeybrock
Copy link

* scipp (@SimonHeybrock, xref [NEP 18, physical units, uncertainties, and the scipp library? #3509](https://github.com/pydata/xarray/issues/3509))

Thanks! I am definitely interested.

@khaeru
Copy link

khaeru commented Aug 11, 2021

👂🏾

@TomNicholas
Copy link
Member Author

Apologies for the long delay, but I would like to try to organise this meeting for some time in the next few weeks. If you are interested could you please fill out this When2Meet meeting time poll. (If none of these times work for you because of time zone issues or something then please say so!) It would be great to have lots of libraries represented.

Agenda doc here - I will come back to this in more detail soon, but if you have specific things you would like to discuss then please add them.

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 21, 2021

TOMORROW: So there is just one slot when essentially everyone said they were free - but it's 11:00am EDT September 22nd, i.e. tomorrow morning.

I appreciate that's late notice - but could we can try for that and if we only get the super keen beans attending this time then we woud still be able to have a useful initial discussion about the exact problems that need resolving?

Alternatively if people could react to this comment with thumbs up / down for "that time is still good" / "no I need more notice" then that would be useful.

@mrocklin are you or anyone else who is able to speak for dask free? I noticed that @jacobtomlinson put down that you were free tomorrow also?

EDIT: @keewis I'm pinging you too

@mrocklin
Copy link
Contributor

mrocklin commented Sep 21, 2021 via email

@jacobtomlinson
Copy link
Contributor

jacobtomlinson commented Sep 21, 2021

I can be there! @jakirkham and @pentschev are also Dask maintainers so there should be good representation.

@jakirkham
Copy link

Maybe too soon to ask, but do we have a link to the video call?

@TomNicholas
Copy link
Member Author

Surprisingly I happen to be free tomorrow at exactly that time.

Great! I sent an invite to the few people whose emails I could easily find - let me know if that didn't arrive @mrocklin .

Maybe too soon to ask, but do we have a link to the video call?

I think we can probably just use the one we use for the xarray bi-weekly dev meetings:

https://us02web.zoom.us/j/88251613296?pwd=azZsSkU1UWJZTVFKNnhIUVdZcENUZz09

@crusaderky
Copy link
Contributor

I'd like to attend too

@beckernick
Copy link

I'd also like to attend, primarily just to learn.

@leofang
Copy link

leofang commented Sep 22, 2021

Will try to join.

@hodgestar
Copy link

I'd like to attend on behalf of QuTiP. I'll likely mostly listen -- QuTiP is not directly affected in the way that xarray, Dask, cupy, etc are, but we're users of potentially all of these array types (and already explicitly support numpy, CuPy, and our own sparse array format) and we are facing similar issues of our own (i.e. users of QuTiP are asking us to develop a __qutip_qobj__ style protocol similar to numpy's and we would like to learn the lessons of the last decade of numpy rather than repeat the steps ourselves in the coming one).

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 22, 2021

Change of plan - use my personal zoom room instead (as I don't think I have permissions to start the xarray one):

https://columbiauniversity.zoom.us/j/97399260034?pwd=WS9GUnVrcG1YSWRPeXhITXRhdEJZUT09

Also unless anyone has a problem I'm going to record the meeting, just for note-taking purposes. (EDIT: I'll just take notes as the meeting goes on instead)

@hameerabbasi
Copy link

I would very much prefer not to be recorded.

@mrocklin
Copy link
Contributor

mrocklin commented Sep 22, 2021 via email

@rgommers
Copy link

rgommers commented Sep 22, 2021

There are also some relevant and very interesting PyTorch development discussions; there are a lot of Tensor subclasses in the making and thought being put into how those interact:

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 22, 2021

Thank you very much to everyone who came today (all 18 of us!). The notes from the meeting are here.

I've raised the ToDos we had as an issue in the new duck array discussion repo - anyone interested in this topic might want to watch the activity on that repository.

The plan now is to continue via asynchronous discussion on that new repo, referring back to issues in specific libraries when necessary. If we come to another sticking point that requires in-person discussion then we can organise another face-to-face meeting at that point, which I am happy to do. However we won't organise another meeting until there is a specific blockage to discuss. I will therefore keep this issue open so I can use it to alert anyone who might be interested in a follow-up meeting.

@jakirkham
Copy link

If you haven't already, would be good if those running into issues here could look over the Array API. This is still something that is being worked on, but the goal is to standardize Array APIs. If there are things missing from that, it would be good to hear about them in a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support upstream issue
Projects
None yet
Development

No branches or pull requests