-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask executor is broken with dask upstream #259
Comments
Seems like that check was added in dask/dask#8452 (cc @gjoseph92). Maybe we need a similar fix as https://github.com/dask/dask-ml/pull/898/files#diff-db396642d72db7ff2df3330275627030696e1ae2dbd1972b86b881529cd519b0R131-R134, by providing the layer |
Yes, this code was previously silently creating invalid Delayed objects; now doing so is an error. From skimming the code, I think you could do |
When I try that with dask==2021.08.1, I get delayed = Delayed(prev_token[0], hlg, layer=stage.name)
E TypeError: __init__() got an unexpected keyword argument 'layer' |
Correct. Matching HLG layer names to task keys would be the backwards-compatible change. |
If it isn't time-sensitive, I can take a look at this tomorrow or Friday. |
Tom I think I see how to fix it. |
Over in #260 there is a naive attempt at that. It passed for me in my local environment with dask 2021.08.1, but it failed after upgrading to dask 2021.12.0.
|
@rabernat I'm just personally curious—why are you using a HLG for this, as opposed to a simple low-level graph? It's evidently adding some extra complexity, but since you're not using Blockwise or anything special like that, I'm wondering what benefit you get from it. |
I don't know. I just vaguely thought it would be better somehow, since our computation certainly fits the HighLevelGraph model well. At the very least, we get a nicer repr for our graph. What would the other possible advantages be? |
IMO there are no advantages (besides the repr I guess) unless you're using non-materialized layers. https://docs.dask.org/en/stable/high-level-graphs.html makes it sound as though there are some special optimizations that are only available by using HLGs, but that's not exactly the case: there are special optimizations available when using non-materialized layers, which can only be used with HLGs. But if you're writing the graphs as dicts, HLGs don't add anything. |
Can you say more about non-materialized layers? Would that make our HLG consume less memory? If so that, would be great. The documentation does not really explain what a non-materialized layer is nor how to construct one. Edit for clarity: we creating the graph as a dict because that's the only way I know how to do it. In our other executors (Prefect, Beam), the mapping operation in each layer is implicit, while in dask it is explicit (every task of the map operation is explicitly represented). Perhaps nml's would allow us to do the same in dask? |
Along with the from_numpy discussion (which xarray calls), you can end up materialising very large graphs for some datasets like the ones in this project. If you mean to slice this up and compute piece-by-piece, then a good fraction of those graph items need not to have been made, which is exactly the sort of thing a HLG layer should be able to handle. Yes, completely agree, the mapping should not, in general, require you to build a dict by hand (although having the option is very useful as an escape hatch). |
Put simply, a materialised layer is simply a wrapping around the previous dict representation, but a non-materialised layer is a prescription for generating the tasks. Like the difference between a list and a generator. The latter must still be iterated over in the scheduler, but if you are to subselect from it, then you can prune the ranges you are iterating over without making the items. However, the whole HLG/layer thing is complex and not much documented. I think <5 people have ever tried to make one. I tried once, and only partially succeeded. |
This is the thing we are trying to turn into a layer: pangeo-forge-recipes/pangeo_forge_recipes/executors/base.py Lines 23 to 28 in 6dda957
It would be great if we did not have to explicitly store each task in the dask graph. Can anyone point me to an example of creating a non-materialized layer? |
That is a perfect use case, and I have also asked for a clear doc showing how you would turn a dict comprehension (which could also express that operation) into the equivalent Layer. |
Agreed! Someday hopefully it will (dask/dask#7709), but it's very unapproachable right now.
This is the sort of thing you could use Blockwise to represent. Blockwise is kind of like NumPy's broadcasting/vectorization logic, applied to patterns of keys in a dask graph. Or I'll write you an example of how to use this in a moment, but first I want to clarify something. I am probably completely misunderstanding pangeo-forge and what we're talking about here, but when I see "pipeline", I think of a pipeline of high-level functions a user is chaining together to produce a dataset. Kind of like Prefect, or Airflow, or something like that. Maybe each of those functions themselves contain dask code, and do all sorts of big array operations and call At the thousand-task level, graph size is just so far from being a problem that even though you could use fancy Blockwise, I don't see the point. Are the graphs of your pangeo-flow pipelines large enough that they're creating problems? Like 100,000+ tasks in the pipeline graph? |
Yes, we have pipelines that process 100000+ files (e.g. #151), creating ~5x that many tasks. In the future, we plan to deploy on orders of magnitude bigger. We have already been struggling with the memory consumption of our Dask graphs (see e.g. #116). We have been able to get around some issues by serializing things more efficiently (e.g. #160). However, the explicit representation of each task in the graph will remain a bottleneck. The dask pipeline exectuor is <50 lines of [hopefully pretty readable] code: pangeo-forge-recipes/pangeo_forge_recipes/executors/dask.py Lines 33 to 78 in 6dda957
If anyone wanted to make a PR to refactor this to use blockwise, that would be super appreciated! We have a strong test suite, so it should be straightforward to know if it works or not. This is probably not something I'll spend time on myself soon, but it's great to know the option is there. |
Still catching up on this thread, but I wanted to comment that I originally wanted to just just HLGs (no low-level dicts) but I ran into some issues building the HLG. I hadn't used We want to and should be able to avoid the low-level dicts. |
See discussion around pangeo-forge#259 (comment)
I have been successfully nerd-sniped #261 |
In our Dask executor, we create a high-level graph as follows:
pangeo-forge-recipes/pangeo_forge_recipes/executors/dask.py
Lines 74 to 76 in 6194c72
Our upstream test (dask commit
725110f9367931291a3e68c9d582544cdb032f77
) has revealed the following error, triggered by the lineIt looks like Dask does not like how we are creating the Delayed object. Is this a dask regression? Or do we need to change our code.
@TomAugspurger - any insights here?
The text was updated successfully, but these errors were encountered: