dealing with very large dask graphs #266

rabernat · 2018-05-19T02:51:12Z

Following up on today's early discussion with @mrocklin, here is an example of a calculation that is currently too "big" for our current pangeo.pydata.org cluster to handle.

It's a pretty simple case:

import xarray as xr
import gcsfs

# enable xarray's new cftime index for non-standard dates
xr.set_options(enable_cftimeindex=True)

# open the dataset (11.3 TB in 292000 chunks)
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/cm2.6/control/temp_salt_u_v-5day_avg/')
ds = xr.open_zarr(gcsmap)

# calculate the squares of all the data variables
ds_squared = ds**2
# and put them back into the original dataset (doubling the size and number of tasks)
for var in ds_squared.data_vars:
    ds[var + '_squared'] = ds_squared[var]

# calculate the monthly-mean climatology and persist for further analysis / visualization
ds_mm_clim = ds.groupby('time.month').mean(dim='time')
# 186 GB in 4800 chunks
ds_mm_clim.persist()

I ran this on a cluster with 100 workers (~600 GB of memory). It got to the scheduler and showed up on the dashboard after ~10 minutes. There were over a million tasks. Workers started to crash and then the notebook crashed.

Some thoughts:

The dask graph itself must be huge. The workers only have 6GB of memory, and the notebook 14. Everyone is running out of memory. Do we just need way more memory? That can be accomplished by changing the image types. We could get orders of magnitude more memory if that's what it will take.
The graph must be very repetitive. Is there an opportunity for some sort of optimization / compression to reduce the size of the graph?
Do we really need to send the whole calculation through it once? Or is possible to "stream" this calculation somehow.

rabernat · 2018-05-19T03:04:59Z

Possibly relevant observation: watching the dashboard, it looks like all the tasks are first sent to a single worker, who then redistributes them to the other workers. This seems like a possible failure point.

mrocklin · 2018-05-19T11:04:35Z

I started writing this yesterday but was interrupted, here are some thoughts on the general topic: dask/dask#3514

mrocklin · 2018-05-19T11:09:53Z

@rabernat along the lines of the first suggestion in the issue posted above, reduce per-task overhead, lets look at one of the errors from a previous issue:

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95}\n\x00\x00\x00\x00\x00\x00\x8c\x14xarray.core.indexing\x94\x8c!ImplicitToExplicitIndexingAdapter\x94\x93\x94)\x81\x94}\x94(\x8c\x05array\x94h\x00\x8c\x17LazilyOuterIndexedArray\x94\x93\x94)\x81\x94}\x94(h\x05\x8c\x17xarray.coding.variables\x94\x8c\x19_ElementwiseFunctionArray\x94\x93\x94)\x81\x94}\x94(h\x05h\x07)\x81\x94}\x94(h\x05\x8c\x14xarray.backends.zarr\x94\x8c\x10ZarrArrayWrapper\x94\x93\x94)\x81\x94}\x94(\x8c\tdatastore\x94h\x11\x8c\tZarrStore\x94\x93\x94)\x81\x94}\x94(\x8c\x02ds\x94\x8c\x0ezarr.hierarchy\x94\x8c\x05Group\x94\x93\x94)\x81\x94(\x8c\rgcsfs.mapping\x94\x8c\x06GCSMap\x94\x93\x94)\x81\x94\x8c\ngcsfs.core\x94\x8c\rGCSFileSystem\x94\x93\x94)\x81\x94}\x94(\x8c\x07project\x94\x8c\rpangeo-181919\x94\x8c\x06access\x94\x8c\x0cfull_control\x94\x8c\x05scope\x94\x8c7https://www.googleapis.com/auth/devstorage.full_control\x94\x8c\x0bconsistency\x94\x8c\x04none\x94\x8c\x05token\x94N\x8c\x07session\x94\x8c\x1egoogle.auth.transport.requests\x94\x8c\x11AuthorizedSession\x94\x93\x94)\x81\x94}\x94(\x8c\x07headers\x94\x8c\x13requests.structures\x94\x8c\x13CaseInsensitiveDict\x94\x93\x94)\x81\x94}\x94\x8c\x06_store\x94\x8c\x0bcollections\x94\x8c\x0bOrderedDict\x94\x93\x94)R\x94(\x8c\nuser-agent\x94\x8c\nUser-Agent\x94\x8c\x16python-requests/2.18.4\x94\x86\x94\x8c\x0faccept-encoding\x94\x8c\x0fAccept-Encoding\x94\x8c\rgzip, deflate\x94\x86\x94\x8c\x06accept\x94\x8c\x06Accept\x94\x8c\x03*/*\x94\x86\x94\x8c\nconnection\x94\x8c\nConnection\x94\x8c\nkeep-alive\x94\x86\x94usb\x8c\x07cookies\x94\x8c\x10requests.cookies\x94\x8c\x11RequestsCookieJar\x94\x93\x94)\x81\x94}\x94(\x8c\x07_policy\x94\x8c\x0ehttp.cookiejar\x94\x8c\x13DefaultCookiePolicy\x94\x93\x94)\x81\x94}\x94(\x8c\x08netscape\x94\x88\x8c\x07rfc2965\x94\x89\x8c\x13rfc2109_as_netscape\x94N\x8c\x0chide_cookie2\x94\x89\x8c\rstrict_domain\x94\x89\x8c\x1bstrict_rfc2965_unverifiable\x94\x88\x8c\x16strict_ns_unverifiable\x94\x89\x8c\x10strict_ns_domain\x94K\x00\x8c\x1cstrict_ns_set_initial_dollar\x94\x89\x8c\x12strict_ns_set_path\x94\x89\x8c\x10_blocked_domains\x94)\x8c\x10_allowed_domains\x94N\x8c\x04_now\x94J\x93\x81\xffZub\x8c\x08_cookies\x94}\x94hkJ\x93\x81\xffZub\x8c\x04auth\x94N\x8c\x07proxies\x94}\x94\x8c\x05hooks\x94}\x94\x8c\x08response\x94]\x94s\x8c\x06params\x94}\x94\x8c\x06verify\x94\x88\x8c\x04cert\x94N\x8c\x08prefetch\x94N\x8c\x08adapters\x94hA)R\x94(\x8c\x08https://\x94\x8c\x11requests.adapters\x94\x8c\x0bHTTPAdapter\x94\x93\x94)\x81\x94}\x94(\x8c\x0bmax_retries\x94\x8c\x12urllib3.util.retry\x94\x8c\x05Retry\x94\x93\x94)\x81\x94}\x94(\x8c\x05total\x94K\x00\x8c\x07connect\x94N\x8c\x04read\x94\x89\x8c\x06status\x94N\x8c\x08redirect\x94N\x8c\x10status_forcelist\x94\x8f\x94\x8c\x10method_whitelist\x94(\x8c\x03GET\x94\x8c\x06DELETE\x94\x8c\x05TRACE\x94\x8c\x07OPTIONS\x94\x8c\x04HEAD\x94\x8c\x03PUT\x94\x91\x94\x8c\x0ebackoff_factor\x94K\x00\x8c\x11raise_on_redirect\x94\x88\x8c\x0fraise_on_status\x94\x88\x8c\x07history\x94)\x8c\x1arespect_retry_after_header\x94\x88ub\x8c\x06config\x94}\x94\x8c\x11_pool_connections\x94K\n\x8c\r_pool_maxsize\x94K\n\x8c\x0b_pool_block\x94\x89ub\x8c\x07http://\x94h\x7f)\x81\x94}\x94(h\x82h\x85)\x81\x94}\x94(h\x88K\x00h\x89Nh\x8a\x89h\x8bNh\x8cNh\x8d\x8f\x94h\x8fh\x96h\x97K\x00h\x98\x88h\x99\x88h\x9a)h\x9b\x88ubh\x9c}\x94h\x9eK\nh\x9fK\nh\xa0\x89ubu\x8c\x06stream\x94\x89\x8c\ttrust_env\x94\x88\x8c\rmax_redirects\x94K\x1eub\x8c\x06method\x94\x8c\x0egoogle_default\x94\x8c\rcache_timeout\x94N\x8c\x0e_listing_cache\x94}\x94ub\x8c0pangeo-data/cm2.6/control/temp_salt_u_v-5day_avg\x94\x86\x94b\x8c\x00\x94\x88N\x88Nt\x94b\x8c\n_read_only\x94\x88\x8c\r_synchronizer\x94N\x8c\x06_group\x94h\xb2\x8c\x06writer\x94\x8c\x16xarray.backends.common\x94\x8c\x0bArrayWriter\x94\x93\x94)\x81\x94}\x94(\x8c\x07sources\x94]\x94\x8c\x07targets\x94]\x94\x8c\x04lock\x94\x89ub\x8c\rdelayed_store\x94Nub\x8c\rvariable_name\x94\x8c\x04salt\x94\x8c\x05shape\x94(M\xb4\x05K2M\x8c\nM\x10\x0et\x94\x8c\x05dtype\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02f4\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94bub\x8c\x03key\x94h\x00\x8c\x0cBasicIndexer\x94\x93\x94)\x81\x94}\x94\x8c\x04_key\x94(\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94t\x94sbub\x8c\x04func\x94\x8c\tfunctools\x94\x8c\x07partial\x94\x93\x94h\n\x8c\x0b_apply_mask\x94\x93\x94\x85\x94R\x94(h\xe7)}\x94(\x8c\x13encoded_fill_values\x94\x8f\x94(\x8c\x15numpy.core.multiarray\x94\x8c\x06scalar\x94\x93\x94h\xcdC\x04\xecx\xad\xe0\x94\x86\x94R\x94\x90\x8c\x12decoded_fill_value\x94G\x7f\xf8\x00\x00\x00\x00\x00\x00h\xc7h\xcduNt\x94b\x8c\x06_dtype\x94h\xcdubh\xd0h\xd2)\x81\x94}\x94h\xd5(h\xd8NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94h\xd8NNN\x87\x94R\x94t\x94sbub\x8c\x0bindexer_cls\x94h\x00\x8c\x0cOuterIndexer\x94\x93\x94ub.'

So that single task has 2.7KB of data. Multiplied by 1,000,000 tasks this alone is 2.7GB of memory at least. I suspect that identifying and resolving these sorts of inefficiencies throughout the stack is likely to be the lowest hanging fruit.

jacobtomlinson · 2018-05-20T12:01:25Z

cc @AlexHilson

rabernat · 2018-05-21T20:15:19Z

So that single task has 2.7KB of data. Multiplied by 1,000,000 tasks this alone is 2.7GB of memory at least. I suspect that identifying and resolving these sorts of inefficiencies throughout the stack is likely to be the lowest hanging fruit.

This seems like an xarray issue. How could xarray make its tasks smaller?

mrocklin · 2018-05-21T20:21:34Z

There is no single thing. Generally you need to construct simple examples, then look at the tasks that they generate, serialize those tasks using pickle or cloudpickle and then look at the results. I'm seeing things about Zarr, cookie policies, tokens, etc.. That gives new directions to investigate.

stale · 2018-07-20T20:50:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-07-27T20:53:25Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

jacobtomlinson added bug dask labels May 20, 2018

stale bot added the stale label Jul 20, 2018

stale bot closed this as completed Jul 27, 2018

guillaumeeb mentioned this issue Aug 13, 2018

Several millions tasks and Client.map dask/distributed#2181

Closed

shoyer mentioned this issue Aug 29, 2018

Large pickle overhead in ds.to_netcdf() involving dask.delayed functions pydata/xarray#2389

Closed

mrocklin mentioned this issue Dec 23, 2019

Add TaskCallable that is like SubgraphCallable dask/dask#5727

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dealing with very large dask graphs #266

dealing with very large dask graphs #266

rabernat commented May 19, 2018

rabernat commented May 19, 2018

mrocklin commented May 19, 2018

mrocklin commented May 19, 2018

jacobtomlinson commented May 20, 2018

rabernat commented May 21, 2018

mrocklin commented May 21, 2018

stale bot commented Jul 20, 2018

stale bot commented Jul 27, 2018

dealing with very large dask graphs #266

dealing with very large dask graphs #266

Comments

rabernat commented May 19, 2018

rabernat commented May 19, 2018

mrocklin commented May 19, 2018

mrocklin commented May 19, 2018

jacobtomlinson commented May 20, 2018

rabernat commented May 21, 2018

mrocklin commented May 21, 2018

stale bot commented Jul 20, 2018

stale bot commented Jul 27, 2018