Skip to content

Best approach to build a single dataset from many numpy arrays in dask #5499

Answered by dcherian
max-sixty asked this question in Q&A
Discussion options

You must be logged in to vote

Is this the best overall approach? Anything I'm missing?

One persistent dask problem is that when read and write tasks are decoupled (as in your case) dask tends to run a lot of read tasks at once resulting in large memory use. The only consistent solution I know is to couple them together, by for e.g. sticking the to_zarr bit in the delayed function.

ds.persist() attempts to bring the whole array back to the scheduler?

On your local machine, this is basically the same as .compute in that everything gets loaded in to local memory. On a distributed cluster, .persist loads into distributed RAM.

Will that raise both a) if the operations trigger a remote computation and b) if operations …

Replies: 2 comments 3 replies

Comment options

max-sixty
Jun 20, 2021
Maintainer Author

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
3 replies
@max-sixty
Comment options

max-sixty Jun 21, 2021
Maintainer Author

@max-sixty
Comment options

max-sixty Jun 27, 2021
Maintainer Author

@dcherian
Comment options

Answer selected by max-sixty
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants