You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I was very excited about the new annotation feature and want to see whether it improves the performance of my workflow.
Basically I have an array that's loaded with a very memory-consuming function, into big chunks initially. Then I want to split it into smaller chunks and store them to zarr. Something like the following:
where the initial random sampling is equivalent to my memory-consuming array creation function.
I want this to run on a memory-limited environment but also take advantage of as many cpu as possible. So my idea is to limit the number of array creation tasks that can be executed simultaneously on a worker, and the new resources annotation seems perfect for this job. So I have tried:
approach 1: annotating the "array creation" steps
According to the documentation the following seem to be the best way to go and should work:
However it seems the annotations are not respected. I can see 16 "random_sample" concurrent operations on my dashboard instead of 4, and the worker quickly gets killed due to memory overload.
This works sort of as expected, where basically at any given time I only have 4 threads "active" regardless of the tasks. However this is not ideal since I only want the initial "random sample" task to be limited. Moreover, this seem to change the scheduler behavior -- all the "random sample" tasks are carried out before any "store" could happen, hence the scheduler are constantly read/write to disk to cache the results of random samples. It seems this should not happen in normal cases according to the task order documentation and I expect the "store" to happen immediately after the corresponding dependencies finished.
This works as expected and the task orders seem correct as well. However this is not ideal since I want to be specific about which tasks should be limited due to memory footprint.
What I expect
I expect approach 1 to work and/or approach 2 to behave "normally" in terms of task orders.
Environment
Dask version: 2021.2.0
Python version: 3.8.1
Operating System: linux-64, ubuntu/18.04.4
Install method (conda, pip, source): conda
The text was updated successfully, but these errors were encountered:
What happened
So I was very excited about the new annotation feature and want to see whether it improves the performance of my workflow.
Basically I have an array that's loaded with a very memory-consuming function, into big chunks initially. Then I want to split it into smaller chunks and store them to zarr. Something like the following:
where the initial random sampling is equivalent to my memory-consuming array creation function.
I want this to run on a memory-limited environment but also take advantage of as many cpu as possible. So my idea is to limit the number of array creation tasks that can be executed simultaneously on a worker, and the new resources annotation seems perfect for this job. So I have tried:
approach 1: annotating the "array creation" steps
According to the documentation the following seem to be the best way to go and should work:
However it seems the annotations are not respected. I can see 16 "random_sample" concurrent operations on my dashboard instead of 4, and the worker quickly gets killed due to memory overload.
approach 2: annotating the storing steps
This works sort of as expected, where basically at any given time I only have 4 threads "active" regardless of the tasks. However this is not ideal since I only want the initial "random sample" task to be limited. Moreover, this seem to change the scheduler behavior -- all the "random sample" tasks are carried out before any "store" could happen, hence the scheduler are constantly read/write to disk to cache the results of random samples. It seems this should not happen in normal cases according to the task order documentation and I expect the "store" to happen immediately after the corresponding dependencies finished.
approach 3: hard-limit the number of threads
This works as expected and the task orders seem correct as well. However this is not ideal since I want to be specific about which tasks should be limited due to memory footprint.
What I expect
I expect approach 1 to work and/or approach 2 to behave "normally" in terms of task orders.
Environment
The text was updated successfully, but these errors were encountered: