Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daskified unyt arrays #185

Merged
merged 69 commits into from
Sep 10, 2022
Merged

Conversation

chrishavlin
Copy link
Contributor

@chrishavlin chrishavlin commented Apr 14, 2021

This PR introduces the unyt_dask_array class, which implements a subclass of standard dask arrays with units attached. Still a work in progress, but it is generally useable now!

Basic usage (also shown here in a notebook) begins by using the unyt_from_dask function to create a new unyt_dask_array instance from a dask array:

from unyt.dask_array import unyt_from_dask
from dask import array as dask_array
x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'm')
x
Out[2]:  unyt_dask_array<random_sample, shape=(10000, 10000), dtype=float64, chunksize=(1000, 1000), chunktype=numpy.ndarray, units=m>

The array can be manipulated as any other dask array:

result = (x * 2).mean()
result
Out[3]: unyt_dask_array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray, units=m>
result.compute()
Out[4]:  unyt_quantity(1.00009275, 'm')

If the return is an array, we get a unyt_array instead:

(x * 2 + x.to('cm')).mean(1).compute()
Out[8]: unyt_array([1.50646938, 1.48487083, 1.49774744, ..., 1.49939197,
            1.49462512, 1.48263323], 'm')

Unit conversions:

x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'lb')
x.mean().compute()
Out[9]:
    unyt_quantity(0.50002619, 'lb')
x.in_mks().mean().compute()
Out[10]: unyt_quantity(0.22680806, 'kg')
x.to('mg').mean().compute()
Out[11]: unyt_quantity(226808.06379903, 'mg')
from unyt import g
x.to(g).mean().compute()
Out[12]: unyt_quantity(226.8080638, 'g')

The implementation relies primarily on decorators and a hidden unyt_dask_array._unyt_array to track unit conversions and has very minimal modifications to the existing unyt codebase. If a user is running a dask client, then all the above calculations will be executed by that client (see notebook), but the implementation here only needs the dask array subset (i.e., pip install dask[array]).

Some remaining known issues:

  • reductions return standard dask arrays when using external functions (see note below)
  • dask is added to _on_demand_imports but haven't added it to the testing environment yet, so new tests will fail
  • haven't yet done flake8/etc checks
  • no new docs yet (update: added to the usage page)
  • new tests could use a bit more tweaking
  • squash commits? I have a lot... but would be easy to squash. let me know. (update: chose not to squash)

Note on the issue with dask reductions:

If you do:

from unyt.dask_array import unyt_from_dask
from dask import array as dask_array

x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'm')
x.min().compute()

You get a unyt_quantity as expected: unyt_quantity(0.50002407, 'm')

But if you use the daskified equivalent of np.min(ndarray):

dask_array.min(x).compute()

You get a plain float: 0.50002407. This isn't much of an issue for simple functions like min, but many more complex functions are not implemented as attributes. Not yet sure what the best approach is here...

Update (8/24) to the dask reductions: I've played around with many approaches focused around manually wrapping all of the dask reductions but have decided that the added complexity is not worth it. Instead, I added a user-facing function, unyt.dask_array.reduce_with_units that accepts a dask function handle, the unyt array and any args and kwargs for the dask function that internally wraps the dask function handle to track units.

standalone package?

One final note: while I've been developing this to be incorporated into unyt, the fact that there are very minimal changes to the rest of the codebase means that this could be a standalone package. Happy to go that route if it seems more appropriate!

@ngoldbaum
Copy link
Member

Just a heads up that the tests are currently broken and need to be migrated to github actions. So far I haven’t summoned the will to figure out github actions. Any help on that front from anyone reading this is much appreciated.

@chrishavlin
Copy link
Contributor Author

Looks like a few more minor fixes needed after that merge... should be straightforward.

@chrishavlin chrishavlin added the enhancement New feature or request label Jul 13, 2022
@neutrinoceros neutrinoceros self-requested a review July 13, 2022 17:07
unyt/dask_array.py Outdated Show resolved Hide resolved
Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great work ! here are a bunch of comments, suggestions and questions.

docs/usage.rst Outdated Show resolved Hide resolved
docs/usage.rst Outdated Show resolved Hide resolved
docs/usage.rst Show resolved Hide resolved
unyt/_on_demand_imports.py Outdated Show resolved Hide resolved
@@ -176,3 +176,37 @@ def use(self):


_matplotlib = matplotlib_imports()


class dask_imports(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note that I refactored the optional imports to be lazy in #250
Even though there should be no incompatibility with this current implementation, I note that importing dask eagerly has a very noticeable impact on import time (it takes longer than importing sympy). We can resolve that at any point before the next release but I suppose it'd be easier to just keep a releasable state at all times, so it'd be preferable to deal with #250 first

unyt/dask_array.py Outdated Show resolved Hide resolved
unyt/dask_array.py Outdated Show resolved Hide resolved
unyt/dask_array.py Outdated Show resolved Hide resolved
unyt/dask_array.py Outdated Show resolved Hide resolved
unyt/dask_array.py Outdated Show resolved Hide resolved
@neutrinoceros
Copy link
Member

@chrishavlin I checked that merging this with #200 didn't cause any of their respective tests to fail so we can push forward with this PR now.
I'm ready to approve and merge this if can resolve my threads with a quick dialogue.

unyt/dask_array.py Outdated Show resolved Hide resolved
@chrishavlin
Copy link
Contributor Author

oh, oops. looks like the minimum dask version for testing caused some problems with 3.8. Let me relax that a little...

@chrishavlin
Copy link
Contributor Author

also, I realized that pickling won't actually work quite right... I spent some time trying thinking that it'd be trivial, but it's not. At this point I'd rather leave it to a followup (will open an issue in a moment)

@chrishavlin
Copy link
Contributor Author

Ok, @neutrinoceros I think the current version addressed all your comments. And I've opened issues to track the long comment/note on dask reductions as you suggested as well as an issue to track the pickle issue.

IMO the pickle issue isn't worth holding up a merge, but if anyone disagrees I'll of course defer. In any case, I'm planning to take a closer look in the next few days -- I'm hoping that it is actually an easy fix and I just wasn't seeing it in my initial attempt today...

@chrishavlin
Copy link
Contributor Author

Oh wait -- I think the one unresolved thing here is whether #250 should go in before this? I'm fine if that's the case.

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the pickle issue isn't worth holding up a merge, but if anyone disagrees I'll of course defer. In any case, I'm planning to take a closer look in the next few days

I agree; we can disclose the issue in the next release notes if necessary.

I think the one unresolved thing here is whether #250 should go in before this?

Not necessarily, I can include the change in my PR too if this one goes in first.

I would merge immediately but I'm going to hold off for the day in case there's anything left to discuss that I missed, or in case you want to make another attempt to fix the pickling issue.
If nothing comes up in the next 24h I'll just merge :)

@chrishavlin
Copy link
Contributor Author

figured out the pickling! turns out I just needed to read the pickle docs again rather than try to do it from memory... will push up by the end of today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unyt_dask_array pickleability
5 participants