Best approach to build a single dataset from many numpy arrays in dask #5499

max-sixty · 2021-06-19T21:33:25Z

max-sixty
Jun 19, 2021
Maintainer

I've almost exclusively used & contributed to xarray for in-memory workloads, and I've just started using it more seriously with dask, in an effort to see if it can replace spark for some areas.

It's a bit harder than I expected. While it's mostly working now, I'm not sure I'm doing the correct thing, and I've blown up my memory or had to restart the dask scheduler many times. I'm a bit nervous that others with more spark vs dask/xarray experience may struggle with some of the debugging. I think part of the improvement would be to calibrate the dask settings, but part may also be I'm not following best-practices in xarray.

So, here's the case:

A function generates a numpy array of ~250MB with an x and y dimension
This is run ~10K times, so total data size of 2.5TB
I'd like to build a single dataset from these, along a new dimension v that exists in the dask cluster

So far my approach has been, conceptually, to a) run each function with dask distributed, writing each array to a distributed filesystem in a format that xarray+dask can read and b) read each array into a dask-backed and concat them into a single dataset, and c) write this whole dataset in dask as a single object which can later be read.

I've put a full example below. While it's not exactly minimal, hopefully the fullness of the examples will make up for it (and the question here is the overall approach, so in this case I think does require the narrative).

Run each function & write the result

Toy function for the example:

import xarray as xr
import numpy as np


def func(v):
    return xr.Dataset(
        dict(
            a=xr.DataArray(np.ones((5000, 5000)), dims=["x", "y"]) * v,
            b=xr.DataArray(np.ones((5000, 5000)), dims=["x", "y"]) * v * -1,
        ),
        coords=dict(v=[v]),
    )

A function to pull an array and write it to a zarr file:

DATA_PATH = "/mnt/dask-test/"


def write_zarr(v):
    ds = func(v)

    ds.chunk(dict(x=-1)).to_zarr(
        f"{DATA_PATH}/chunks/{v}.zarr", mode="w", consolidated=True
    )

Then run this over all 10K values:

import dask
import socket

client = dask.Client(f"tcp://{socket.gethostname()}:8786")

tasks = [dask.delayed(func)(v) for v in range(10_000)]
futures = client.compute(tasks)

Read the arrays into dask-backed arrays and concat them into a single dataset

Either (a) in a single call:

ds = xr.open_mfdataset(f"{DATA_PATH}/chunks/*.zarr", engine="zarr", chunks="auto")

or (b) in a loop:

dss = {}
for v in range(10_000):
    dss[v] = xr.open_zarr(
        f"{DATA_PATH}/chunks/{v}.zarr", consolidated=True, chunks="auto"
    ).persist()

ds = xr.concat(list(dss.values()), dim="v", data_vars=["a", "b"])

Are the xr.concat and xr.open_mfdataset broadly equivalent (some docs here)? Any reason to use one and not the other? With such a simple arrangement of the arrays, are there settings that will ensure I don't bring all data back locally? There are docs on combining coords — my understanding is that's not a risk here though. But a few of times my local memory blew up as I was testing various settings; I think was the dask scheduler, which is running on my local machine, rather than the python client process.

Is chunks="auto" OK here?

Write this whole dataset in dask as a single object

future = ds.to_zarr(
    f"{DATA_PATH}/single/X.zarr", mode="w", consolidated=True, compute=False
)
client.compute(future)

...and as a test, read it back out again...

ds = xr.open_zarr(f"{DATA_PATH}/single/X.zarr")

...and it works!

But:

ds.persist()

...fails with a memory error, after attempting to bring the whole dataset into memory in the dask scheduler on my local machine. (It also sometimes failed with a TypeError 'future' object is not subscriptable which I guess is unrelated.)

Salient questions

Is this the best overall approach? Anything I'm missing?
I've had my memory blow up a few times after dask attempted to bring all 2.5TB into the scheduler. I've been using raise_if_dask_computes from our test suite to ensure operations don't compute locally. Is that the right approach? Will that raise both a) if the operations trigger a remote computation and b) if operations bring data locally? I mostly care about about operations remaining distributed and not bringing all the data back to the client / scheduler.
Any ideas why:
- Sometimes dask attempts to bring large amounts of data back to scheduler? I get all 400GB on my local machine filling up, and then errors from tcp. This may be coincident with workers being killed for hitting memory thresholds.
- ds.persist() attempts to bring the whole array back to the scheduler?

I realize this is very long — any help with a part of this is greatly appreciated; definitely no need to respond to them all.

max-sixty
Jun 20, 2021
Maintainer Author

I made some progress here which significantly narrows the problem:

"Run each function & write the result" with .chunk(...).to_zarr potentially intersects with a dask memory leak (Scheduler memory leak / large worker footprint on simple workload dask/distributed#3898 (comment)). IIUC running it with .chunk means it's run "natively" in dask, such that it schedules parts of the computation as dask tasks, rather than running it as an opaque function. Removing this fixes most of the dask issues. (though is there anything to be worried about with the later workflow?)
When writing each segment, running a .reindex to force all the slices to have the same index (note that the toy function above didn't have this problem, so wasn't representative) seems to fix :
- some of the issues around bringing down the 400GB scheduler. . Possibly related to coords=different here?
- the ds.persist() failure
open_mfdataset seems to work well with concat_dim="v', combine="nested", and is obv simpler than the loop.

The main outstanding questions are:

Is this the best overall approach? Anything I'm missing?
If we didn't have aligned indexes (e.g. we didn't know ahead of time what all the index values would be), would the be an approach that worked? Stack everything and then unstack?

0 replies

dcherian · 2021-06-21T15:39:17Z

dcherian
Jun 21, 2021
Maintainer

Is this the best overall approach? Anything I'm missing?

One persistent dask problem is that when read and write tasks are decoupled (as in your case) dask tends to run a lot of read tasks at once resulting in large memory use. The only consistent solution I know is to couple them together, by for e.g. sticking the to_zarr bit in the delayed function.

ds.persist() attempts to bring the whole array back to the scheduler?

On your local machine, this is basically the same as .compute in that everything gets loaded in to local memory. On a distributed cluster, .persist loads into distributed RAM.

Will that raise both a) if the operations trigger a remote computation and b) if operations bring data locally?

Yes. It will raise on both .compute and .persist. I think by "remote computation" you mean persist?

When writing each segment, running a .reindex to force all the slices to have the same index

Not sure what this means. DO you have a small example?

But a few of times my local memory blew up as I was testing various settings;

Yeah I think compat="no_conflicts", coords="different", join="outer" (the default) can result in huge computations. We really should change the default.

3 replies

max-sixty Jun 21, 2021
Maintainer Author

Thanks for the reply Deepak!

One persistent dask problem is that when read and write tasks are decoupled (as in your case) dask tends to run a lot of read tasks at once resulting in large memory use. The only consistent solution I know is to couple them together, by for e.g. sticking the to_zarr bit in the delayed function.

To confirm, do you mean do the second stage "read all the individual arrays" and the third stage "write out a single array)" should be in the same delayed function, like:

def concat_and_write():
    ds = xr.open_mfdataset(f"{DATA_PATH}/chunks/*.zarr", engine="zarr", chunks="auto")
    ds.to_zarr(
        f"{DATA_PATH}/single/X.zarr", mode="w", consolidated=True, compute=True
    )

task = dask.delayed(concat_and_write)
future = client.compute(task)

ds.persist() attempts to bring the whole array back to the scheduler?

On your local machine, this is basically the same as .compute in that everything gets loaded in to local memory. On a distributed cluster, .persist loads into distributed RAM.

Great, this is what I thought. So as long as I only call .persist while using a distributed cluster, it shouldn't be possible to blow up my local memory.

Not sure what this means. DO you have a small example?

Sorry, breaking my own rule here ("please review #5404").

The toy function above always has the same indexes (default indexes), but the actual output has different indexes for each array, a bit like:

diff --git a/zarr-repro.py b/zarr-repro.py
index d1a8f4f6..14eec667 100644
--- a/zarr-repro.py
+++ b/zarr-repro.py
@@ -3,12 +3,14 @@
 
 
 def func(v):
+    index_x = np.sort(np.random.choice(10000, 5000, replace=False))
+
     return xr.Dataset(
         dict(
             a=xr.DataArray(np.ones((5000, 5000)), dims=["x", "y"]) * v,
             b=xr.DataArray(np.ones((5000, 5000)), dims=["x", "y"]) * v * -1,
         ),
-        coords=dict(v=[v]),
+        coords=dict(v=[v], x=index_x),
     )

So running reindex on each array seems to make a big improvement:
(this also removes the .chunk as mentioned above, which is orthogonal)

diff --git a/zarr-repro.py b/zarr-repro.py
index 14eec667..795d1718 100644
--- a/zarr-repro.py
+++ b/zarr-repro.py
@@ -1,4 +1,5 @@
 import xarray as xr
+import pandas as pd
 import numpy as np
 
 
@@ -23,7 +24,7 @@ def func(v):
 def write_zarr(v):
     ds = func(v)
 
-    ds.chunk(dict(x=-1)).to_zarr(
+    ds.reindex(x=pd.RangeIndex(10000)).to_zarr(
         f"{DATA_PATH}/chunks/{v}.zarr", mode="w", consolidated=True
     )

Yeah I think compat="no_conflicts", coords="different", join="outer" (the default) can result in huge computations. We really should change the default.

I think that with the reindex, the default might be OK. But it would also be OK to pass compat="equals", coords="minimal", join="exact" to ensure that...

One thing that may make this clearer, and I can take the first stab at a couple of cases, is some examples of the correct settings depending on the shape of your data.

e.g. this could form the first of the examples — "best practice for assembling a 3D dataset from many 2D datasets".

max-sixty Jun 27, 2021
Maintainer Author

Some conclusions from this effort:

It's really important to have data aligned! Adding ds.reindex(x=pd.RangeIndex(10000)).to_zarr( reduced the open_zarr call from 90 minutes to a few minutes.
- I would be up for changing the default so the default behavior is the safe one — do we think there's consensus here? (Also fine if someone with more knowledge of the code wants to do this)
I think a reasonable mental model of a Dataset is a dask interface — e.g. doing delayed(ds) will cause ds to be computed by dask; something that took me a while to realize.
Nothing to do with xarray, but I did find dask seemed to crash a lot. A lot of the time, things would be going fine during an operation, and then the scheduler would become unresponsive, and start accumulating memory. The docs are great, but I spent a great deal of time finding the dask issue which described some variant of the problem I was hitting. Distributed systems are hard!
- Maybe some of that is the local environment, or some of what I was doing was not the well-trodden path.
- I was working with about 8TB of data, with 24TB of memory in the cluster, and a 400GB local node for the scheduler. The data was in three dimensions. with n_chunks of (8000, 2, 2), (n_chunks, not chunk size), I'm not sure if that mattered; I didn't attempt other arrangements.

Thanks for the guidance @dcherian , that was v helpful last week.

dcherian Jun 28, 2021
Maintainer

doing delayed(ds) will cause ds to be computed by dask;

Usually you call delayed on a function: see https://docs.dask.org/en/latest/delayed-best-practices.html .

But we really should have .to_delayed that mimics dask_array.to_delayed. It would make a lot of things easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best approach to build a single dataset from many numpy arrays in dask #5499

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best approach to build a single dataset from many numpy arrays in dask #5499

max-sixty Jun 19, 2021 Maintainer

Run each function & write the result

Read the arrays into dask-backed arrays and concat them into a single dataset

Write this whole dataset in dask as a single object

Salient questions

Replies: 2 comments · 3 replies

max-sixty Jun 20, 2021 Maintainer Author

dcherian Jun 21, 2021 Maintainer

max-sixty Jun 21, 2021 Maintainer Author

max-sixty Jun 27, 2021 Maintainer Author

dcherian Jun 28, 2021 Maintainer

max-sixty
Jun 19, 2021
Maintainer

Replies: 2 comments 3 replies

max-sixty
Jun 20, 2021
Maintainer Author

dcherian
Jun 21, 2021
Maintainer

max-sixty Jun 21, 2021
Maintainer Author

max-sixty Jun 27, 2021
Maintainer Author

dcherian Jun 28, 2021
Maintainer