Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for duck Dask Arrays #4208

Closed
jthielen opened this issue Jul 8, 2020 · 18 comments · Fixed by #4221
Closed

Support for duck Dask Arrays #4208

jthielen opened this issue Jul 8, 2020 · 18 comments · Fixed by #4221
Labels
topic-arrays related to flexible array support

Comments

@jthielen
Copy link
Contributor

jthielen commented Jul 8, 2020

#525 (comment) raised the idea of adding "duck Dask Array" support to xarray as a way to handle xarray > Pint Quantity > Dask Array wrapping in a way that still allowed for most of xarray's Dask integration to work properly. With @rpmanser working on implementing the Dask collection interface in Pint (hgrecco/pint#1129), I thought it best to elevate this to its own issue to track progress and discuss implementation on xarray's side (since hopefully @rpmanser or I can get started on it soon).

Two initial (and intertwined) discussion points that I'd like to bring up (xref hgrecco/pint#1129 (comment)):

  • How should xarray check for a duck Dask Array?
  • Is it acceptable for a Pint Quantity to always have the Dask collection interface defined (i.e., be a duck Dask array), even when its magnitude (what it wraps) is not a Dask Array?

cc @keewis, @shoyer, @crusaderky

@shoyer
Copy link
Member

shoyer commented Jul 8, 2020

Maybe something like this would work?

def is_duck_dask_array(x):
  return getattr(x, 'chunks', None) is not None

xarray.DataArray would pass this test (chunks is either None for non-dask arrays or a tuple for dask arrays), so this would be consistent with what we already do.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 8, 2020

Maybe something like this would work?

def is_duck_dask_array(x):
  return getattr(x, 'chunks', None) is not None

xarray.DataArray would pass this test (chunks is either None for non-dask arrays or a tuple for dask arrays), so this would be consistent with what we already do.

That would be a straightforward solution to both problems! A Pint Quantity containing a Dask Array passes along the chunks attribute from the Dask Array, and a Pint Quantity containing something else will raise an AttributeError. Unless there are other objections, I'll see what it will take to swap out the existing Dask checks for this in the xarray internals and hopefully get around to a PR (after I get some MetPy stuff done first).

@shoyer
Copy link
Member

shoyer commented Jul 9, 2020

It might also make sense to check for one or more of the special dask collection attributes (__dask_graph__, __dask_keys__, etc)

@crusaderky
Copy link
Contributor

crusaderky commented Jul 9, 2020

Is it acceptable for a Pint Quantity to always have the Dask collection interface defined (i.e., be a duck Dask array), even when its magnitude (what it wraps) is not a Dask Array?

I think there are already enough headaches with __iter__ being always defined and confusing libraries such as pandas (hgrecco/pint#1128).
I don't see why pint should be explicitly aware of dask (except in unit tests)? It should only deal with generic NEP18-compatible libraries (numpy, dask, sparse, cupy, etc.).

How should xarray check for a duck Dask Array?

We should ask the dask team to formalize what defines a "dask-array-like", like they already did with dask collections, and implement their definition in xarray.
I'd personally make it "whatever defines a numpy-array-like AND has a chunks method AND the chunks method returns a tuple".

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Jul 9, 2020
@jthielen
Copy link
Contributor Author

jthielen commented Jul 9, 2020

I think there are already enough headaches with __iter__ being always defined and confusing libraries such as pandas (hgrecco/pint#1128).
I don't see why pint should be explicitly aware of dask (except in unit tests)? It should only deal with generic NEP18-compatible libraries (numpy, dask, sparse, cupy, etc.).

Since Pint wraps Dask, in order to leverage Dask Array functionality on Pint Quantities, we need to have the Dask collection interface available. In a sense, Pint needs special handling for Dask like xarray Variables do since they both can be upcast types of Dask Array. Implicitly passing through attributes (how Pint handles special methods/attributes of downcast types in general) from the wrapped Dask Array is not sufficient, however, because the finalizers have to rewrap with Quantity (see https://github.com/hgrecco/pint/pull/1129/files#diff-d9924213798d0fc092b8cff13928d747R1947-R1950), hence the explicit awareness of Dask being needed in Pint.

We should ask the dask team to formalize what defines a "dask-array-like", like they already did with dask collections, and implement their definition in xarray.

Done! See dask/dask#6385.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 9, 2020

Based on @mrocklin's comment in dask/dask#6385, the plan will be to check for duck Dask Arrays with dask.base.is_dask_collection along with xarray's previously used duck array check. This works properly with Pint Quantities as implemented in hgrecco/pint#1129 (returning True if the Pint Quantity contains a Dask Array, and False if it does not).

@rpmanser
Copy link
Contributor

rpmanser commented Jul 9, 2020

I can go ahead with putting together a PR for this. Before I do so, I'd like to clarify what is expected.

  • Implement the is_duck_dask_array() function
  • In that implementation, use dask.base.is_dask_collection() and the existing duck array check(s)
  • Replace isinstance(x, dask.array.Array) checks with the new is_dask_duck_array() function

I searched for existing duck array checks in xarray and nothing immediately obvious to me is showing up. It looks like a check for __array_function__ is inappropriate based on discussion in #3917. Could someone point out the proper duck array check?

@dcherian
Copy link
Contributor

dcherian commented Jul 9, 2020

We have https://github.com/pydata/xarray/blob/master/xarray/core/pycompat.py which defines dask_array_type and sparse_array_type and then use isinstance(da, dask_array_type) in a bunch of places (e.g. duck_array_ops).

re duck array check: @keewis added this recently

xarray/xarray/core/utils.py

Lines 250 to 253 in f3ca63a

def is_array_like(value: Any) -> bool:
return (
hasattr(value, "ndim") and hasattr(value, "shape") and hasattr(value, "dtype")
)

@jthielen
Copy link
Contributor Author

jthielen commented Jul 9, 2020

@rpmanser That sounds like a good plan to me at least, but it would be great if any of the xarray maintainers would be able to chime in. Also, thanks again for being willing to work on this while I try working on dask/dask#4583. The hidden 4th step is of course testing--primarily that this doesn't break existing functionality, but also that it works for duck Dask Arrays other than Dask Arrays themselves (not sure if Pint Quantities in upcoming v0.15 or a mocked class would be better).

Also, thank you @dcherian for pointing out those checks, you found them faster than I did!

@dopplershift
Copy link
Contributor

dopplershift commented Jul 11, 2020

Does/should any of this also consider #4212 (CuPy)?

@jthielen
Copy link
Contributor Author

Does/should any of this also consider #4212 (CuPy)?

Only indirectly, since this deals with duck Dask arrays (things like Pint that go between xarray and Dask) rather than Dask chunks, which CuPy would be. But, once this, #4212, hgrecco/pint#964, and dask/dask#6393 are all in place, then we can test if xarray( pint( dask( cupy ))) works automatically from it all or not.

@dcherian
Copy link
Contributor

A couple of things came up in #4221

  1. how do we ask a duck dask array to rechunk itself? pint seems to forward the .rechunk call but that isn't formalized anywhere AFAICT.
  2. less important: should duck dask arrays cache their token somewhere? dask.array uses .name to do this and xarray uses that to check equality cheaply. We can use tokenize of course. But I'm wondering if it's worth asking duck dask arrays to cache their token as an optimization.

@mrocklin
Copy link
Contributor

In Xarray we implemented the Dask collection spec. https://docs.dask.org/en/latest/custom-collections.html#the-dask-collection-interface

We might want to do that with Pint as well, if they're going to contain Dask things. That way Dask operations like dask.persist, dask.visualize, and dask.compute will work normally.

@mrocklin
Copy link
Contributor

My guess is that we could steal the xarray.DataArray implementations over to Pint without causing harm.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 23, 2020

In Xarray we implemented the Dask collection spec. https://docs.dask.org/en/latest/custom-collections.html#the-dask-collection-interface

We might want to do that with Pint as well, if they're going to contain Dask things. That way Dask operations like dask.persist, dask.visualize, and dask.compute will work normally.

That's exactly what's been done in Pint (see hgrecco/pint#1129)! @dcherian's points go beyond just that and address what Pint hasn't covered yet through the standard collection interface.

@mrocklin
Copy link
Contributor

That's exactly what's been done in Pint (see hgrecco/pint#1129)! @dcherian's points go beyond just that and address what Pint hasn't covered yet through the standard collection interface.

Ah, great. My bad.

how do we ask a duck dask array to rechunk itself? pint seems to forward the .rechunk call but that isn't formalized anywhere AFAICT.

I think that you would want to make a pint array rechunk method that called down to the dask array rechunk method. My guess is that this might come up in other situations as well.

less important: should duck dask arrays cache their token somewhere? dask.array uses .name to do this and xarray uses that to check equality cheaply. We can use tokenize of course. But I'm wondering if it's worth asking duck dask arrays to cache their token as an optimization.

I think that implementing the dask.base.normalize_token method should be fine. This will probably be very fast because you're probably just returning the name of the underlying dask array as well as the unit of the pint array/quatity. I don't think that caching would be necessary here.

It's also possible that we could look at the __dask_layers__ method to get this information. My memory is a bit fuzzy here though.

@dcherian
Copy link
Contributor

Re:rechunk, this should be part of the spec I guess. We need this for DataArray.chunk().

xarray does do some automatic rechunking in variable.py. But this comment:

            # chunked data should come out with the same chunks; this makes
            # it feasible to combine shifted and unshifted data
            # TODO: remove this once dask.array automatically aligns chunks

suggest that we could delete that automatic rechunking today.

This will probably be very fast because you're probably just returning the name of the underlying dask array as well as the unit of the pint array/quatity.

ah yes, we can rely on the underlying array library to optimize this.

@mrocklin
Copy link
Contributor

Dask collections tokenize quickly. We just use the name I think.

dcherian added a commit that referenced this issue Sep 2, 2020
* Change isinstance checks to duck Dask Array checks #4208

* Use is_dask_collection in is_duck_dask_array

* Use is_dask_collection in is_duck_dask_array

* Revert to isinstance checks according to review discussion

* Move is_duck_dask_array to pycompat.py and use tokenize for comparisons

* isort

* Implement `is_duck_array` to replace `is_array_like`

* Rename `is_array_like` to `is_duck_array`

* `is_duck_array` checks for `__array_function__` and `__array_ufunc__`
  in addition to previous checks

* Replace checks for `is_duck_dask_array` and `__array_function__` with
  `is_duck_array`

* Skip numpy duck array tests when NEP18 is not active

* Use utils.is_duck_array in xarray/core/formatting.py

* Replace locally defined `is_duck_array` in _diff_mapping_repr

* Replace `"__array_function__"` and `is_duck_dask_array` check in
  `short_data_repr`

* Revert back to isinstance check for iris cube

* Add is_duck_array_or_ndarray function to utils

* Use is_duck_array_or_ndarray for duck array checks without NEP18

* Remove is_duck_dask_array_or_ndarray, replace checks with is_duck_array

* Add explicit check for NumPy array to is_duck_array

* Replace is_duck_array_or_ndarray checks with is_duck_array

* Remove is_duck_array check for deep copy

Co-authored-by: keewis <[email protected]>

* Use is_duck_array check in load

* Move duck dask array tokenize tests from test_units.py to test_dask.py

* Use _importorskip to require pint >=0.15 instead of pytest.mark.skipif

Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: keewis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants