-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor work scheduling when cluster adapts size #4471
Comments
@fjetter I'm starting to dig through the scheduler code. I can mitigate this specific issue by assigning tasks to workers without the smarts of distributed/distributed/scheduler.py Lines 2122 to 2136 in 77a1fd1
|
Looking at the "after" picture (following your scheduler modifications) I see a lot of red. I think this makes sense since part of that first conditional branch is putting tasks on the same workers as their dependencies. but I suspect that the long-running task part of this is playing a role, too -- the (lack of) task occupancy on the unused workers should be encouraging the scheduler to send the tasks there. There may be an update from the workers missing here (or just out of place) |
Yeah, I noticed that, too. I do see some red in the "before" plot, which is sparser. I couldn't actually tell if things were worse -- but I suspect as you do, that they should be. If I use map_blocks instead of map_overlap (removing the func2 stage to be more fair, as it gets fused in the map_blocks case), the graph looks very efficient in the "before" case. Could that argue that updates are being correctly done? Could it be that the dask.order ordering that is created with a map_overlaps graph is not well-suited for the work-stealing/update algorithm (in this use case of cluster size changing)? |
And that does point to something in the stealing heuristics that doesn't perform ideally when there are a bunch of linked + long-running tasks. edit: I said stealing but I meant scheduling, but stealing might also play a part here |
I tried following what the scheduler is doing (with stealing turned off), but with But I have found a workaround, via a tip from @shoyer. By using the graph manipulation API, I can coerce the gnarly overlap section of the graph to complete before doing any mapping. It means breaking up a
This seems to distribute work better on some moderate test cases. I only have limited time to tackle my dask-related issues this week. If I can fix a memory leak, maybe I can come back to take a deeper look at this. |
FWIW, when the solution with wait_on is deployed on a cluster with auto-scaling (at least in my hands), I do not see the same improvement. I'm not sure exactly what could be different. For now, I just manually provision/deprovision, and avoid autoscaling for these calculations. |
I started peaking into this issue. First of all, thanks @chrisroat for the reproducer, this is incredibly helpful! My naive expectation for this workload is:
Neither is happening :'( Task stealing for this workload is effectively disabled since all tasks are incredibly small and the workstealing engine doesn't consider it worthwhile to move anything. latencies would objectively make the computation of a single task much more costly than if it was computed on the worker it was initially assigned on. This is obviously a heuristic but usually works well. I think we're missing a measure for opportunity cost, i.e. "what is the cost for not moving the task" or put differently: "How much is the entire graph delayed if this task is not moved". I don't know, yet, how we could compute something like this efficiently but I'll think about it for a bit, I think that would be a valuable addition to work stealing in general if we could approximate this efficiently. The fact that the decision after scale up is less than optimal is caused by us strictly preferring to assign a task to the workers which have their dependencies. Both are examples where we make decisions which are arguably good micro optimizations but poor macro optimizations. That's a very interesting problem :) |
It's interesting that I've also had to solve this problem in #4899. See this comment: distributed/distributed/scheduler.py Lines 7571 to 7576 in 0fbb75e
I wonder if there's a way in all cases that we could consider candidates beyond just workers that have the dependencies. (Or at least do this when the size of the dependencies is small, by some heuristic.) The objective function should still select workers that have the dependencies when it's reasonable to do so (since they are penalized less for data transfer), so I feel like it would strictly be an improvement in worker selection, though it might slow down |
Are there any parameters for work stealing or graph hinting that could mitigate this issue (even if it were in a fork where I add my own env var or something like that)? As an example, I have tried to hint at the long tasks, hoping that would outweigh the "keep near dependencies" effect):
|
OK, so there is a possible solution for this up in #6115 . Comments about what I did and why are in the opening comment / commit message. I don't know yet if this should go in (there was a reason why we turned off stealing for very small tasks) but there is some reason to do it. To summarize what was happening, map-overlap computations introduce lots of tasks with very short durations. Dask's workstealing is greedy, and strictly avoids stealing any task that takes longer than 5ms. It also sharply penalizes stealing by adding a 100ms network latency into its heuristic. By changing these policies this computation starts to flow smoothly again. It's not clear though if this resolves the broader problem. @chrisroat @bnaul if this is still an active topic for you then I encourage you to try out that branch. |
One thing to note is that as the inputs climb in size the argument to steal becomes weaker and weaker. There is still a greedy-vs-long-term-view issue that we're not resolving here. My change in that PR fixes the reproducer here, but I'd be curious to learn if it fixes other issues in practice. |
Very interesting...will give it a shot, a couple of interesting quirks of our usage that might be also be of interest are:
|
What happened:
When a cluster is autoscaled in increments while running, as can happen in a GKE cluster, the work concentrates on few workers and uses the cluster inefficiently. This seems to be worse when there are long running tasks. The example below simulates this by adjusting a local cluster's size as it is processing a graph with 10-second tasks.
The image below shows the final look of the task graph, and the animated gif shows the status screen as the cluster processes the graph. Many workers do zero long tasks, and only a few workers seem to be fully utilised.
If the cluster is initially set to 20 workers with no changes, work is distributed evenly and all workers are efficiently used.
@fjetter In this example, the workers under load are green a lot.
What you expected to happen:
After the cluster scales up, the work should be evenly divided among all the workers.
Minimal Complete Verifiable Example:
Anything else we need to know?:
In a real cluster with my real load (not in the simulation above), I also will see the scheduler CPU pegged near 100% (possibly due to #3898), even when all workers are working on the long tasks. This seems odd, since nothing is being actively changed in the scheduling.
Environment:
Final task screenshot:
Movie of workload (at 10x speed):
The text was updated successfully, but these errors were encountered: