Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Dask collection interface #1129

Merged
merged 9 commits into from
Jul 8, 2020
Merged

Implement Dask collection interface #1129

merged 9 commits into from
Jul 8, 2020

Conversation

rpmanser
Copy link
Contributor

@rpmanser rpmanser commented Jul 3, 2020

This pull request implements the Dask collection interface for the Quantity class and adds convenience methods compute(), persist(), and visualize(). As is, only the convenience methods are wrapped to check that the magnitude of the Quantity is a dask array, which should cover current use cases. Ping @jthielen since we've been discussing this.

I have not tested persist() on a HPC cluster, which will have different behavior than when it is called on a single machine (e.g., desktop). Is there a way to integrate tests for distributed cases? If not, I can test this independently if needed.

Add `__dask_*__` methods to the Quantity class. Add convenience
functions `compute()`, `persist()`, and `visualize()`. Wrap these
convenience functions to raise an AttributeError if the wrapped object
is not a Dask Array.
Update downcast tests to include Dask Array. Add testing module for
Dask collection interface and convenience methods.
@keewis
Copy link
Contributor

keewis commented Jul 3, 2020

xarray's distributed tests might be useful as an example? They use the loop fixture and the Cluster and Client context managers to set up a fake distributed cluster.

@rpmanser
Copy link
Contributor Author

rpmanser commented Jul 3, 2020

I'll take a look and see if I can setup something similar. Thanks!

Change these methods to `@property`s and have them redirect to the
appropriate Dask Array methods. Tests are included. Add a
test utility function for incrementing test quantities.
pint/testsuite/test_dask.py Outdated Show resolved Hide resolved
pint/testsuite/test_dask.py Outdated Show resolved Hide resolved
pint/testsuite/test_dask.py Show resolved Hide resolved
@rpmanser
Copy link
Contributor Author

rpmanser commented Jul 6, 2020

I made all of the suggested changes and added tests for visualize and explicitly checking that compute() and persist() return the same results for the single machine case. I will be working on distributed tests next.

@hgrecco
Copy link
Owner

hgrecco commented Jul 7, 2020

This is looking really good, please feel free to ping me when you want me to review it and merge it.

I would like to take the opportunity to ask a question: When are we going to provide an external package (e.g. pint-pandas) and when are we going to include it in pint. I am not suggesting to create pint-dask but rather to start building a rationale about pint extensions.

@dopplershift
Copy link
Contributor

My concern with extension packages is that they won't necessarily receive the same level of maintenance as the core pint library. As a maintainer of a downstream library, I'm always concerned about adding more dependencies.

What do people see as the benefits of having external packages?

@jthielen
Copy link
Contributor

jthielen commented Jul 7, 2020

@rpmanser As far as implementing the collection interface and testing it, this looks great! One small suggestion I would have is referring to dask/dask#4583 in the NumPy doc example, since it is Dask holding us back from the usual "multiply on the right" construction. Also, I think this gets enough of #883 taken care of to consider it closed, but I think a follow-up issue should be raised as a reminder to finish up tests once dask/dask#4583 is resolved (masked arrays and numpy/numpy#15200 are in the same position).

One bigger-picture question/discussion point: in pydata/xarray#525 and #883 (and in other discussions), I was thinking of Pint Quantities "acting like a Dask Array" or "being a duck Dask Array" would occur if and only if the Quantity is wrapping a Dask Array. That is, a library like xarray could recognize a suitable Dask Array-like by a test like is_duck_array(data) and hasattr(data, "__dask_graph__") or similar, and this test could be True if the magnitude of the Quantity was a Dask Array and False if not. However, with the implementation here, a Pint Quantity will always be a duck Dask Array since the Dask interface methods will always be available (though of course not always usable). Also, in seeing the implementation here, I've realized that a hacky implementation of these methods through attachment in __new__ or routing through __getattr__ would be needed to achieve my initial idea (rather than a simple decorator check).

So, with that in mind, is it okay for xarray (and any other libraries depending on the duck Dask Array interfacing of Quantity) to have Quantity always be a duck Dask Array, or do these interface methods need to be "hidden" and only available when the magnitude of a Quantity is a Dask Array? While there ostensibly is another option (a "DaskQuantity" subclass that has these methods available), I think that would make the situation worse, since too much magic/special casing may be required to get a DaskQuantity as the result of arbitrary operations between Dask Arrays and regular Quantities.

@rpmanser
Copy link
Contributor Author

rpmanser commented Jul 8, 2020

This is ready for your review @hgrecco. Looks like the travis CI has been failing due to the -rc flags given to isort, but otherwise black -t py36 . && isort -rc . && flake8 runs without error.

@dopplershift
Copy link
Contributor

@jthielen In the absence of documentation on some kind of protocol, this might be one of those things that has to be worked out in practice. Having said that, personally I don't consider __getattr__ a hack for this use case.

@jthielen
Copy link
Contributor

jthielen commented Jul 8, 2020

@jthielen In the absence of documentation on some kind of protocol, this might be one of those things that has to be worked out in practice. Having said that, personally I don't consider __getattr__ a hack for this use case.

Sounds good. I've also asked over in xarray (pydata/xarray#4208), so we'll see if we get any feedback there.

@hgrecco
Copy link
Owner

hgrecco commented Jul 8, 2020

bors r+

@hgrecco
Copy link
Owner

hgrecco commented Jul 8, 2020

Congratulations @rpmanser for the awesome work!

@bors
Copy link
Contributor

bors bot commented Jul 8, 2020

Build succeeded:

@bors bors bot merged commit f1ddf49 into hgrecco:master Jul 8, 2020
@rpmanser
Copy link
Contributor Author

rpmanser commented Jul 8, 2020

Congratulations @rpmanser for the awesome work!

Thanks, happy I could help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration with Dask (add tests; implement the Dask collection interface on Quantity)
5 participants