-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dealing with very large dask graphs #266
Comments
Possibly relevant observation: watching the dashboard, it looks like all the tasks are first sent to a single worker, who then redistributes them to the other workers. This seems like a possible failure point. |
I started writing this yesterday but was interrupted, here are some thoughts on the general topic: dask/dask#3514 |
@rabernat along the lines of the first suggestion in the issue posted above, reduce per-task overhead, lets look at one of the errors from a previous issue:
So that single task has 2.7KB of data. Multiplied by 1,000,000 tasks this alone is 2.7GB of memory at least. I suspect that identifying and resolving these sorts of inefficiencies throughout the stack is likely to be the lowest hanging fruit. |
cc @AlexHilson |
This seems like an xarray issue. How could xarray make its tasks smaller? |
There is no single thing. Generally you need to construct simple examples, then look at the tasks that they generate, serialize those tasks using pickle or cloudpickle and then look at the results. I'm seeing things about Zarr, cookie policies, tokens, etc.. That gives new directions to investigate. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
Following up on today's early discussion with @mrocklin, here is an example of a calculation that is currently too "big" for our current pangeo.pydata.org cluster to handle.
It's a pretty simple case:
I ran this on a cluster with 100 workers (~600 GB of memory). It got to the scheduler and showed up on the dashboard after ~10 minutes. There were over a million tasks. Workers started to crash and then the notebook crashed.
Some thoughts:
The text was updated successfully, but these errors were encountered: