-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design and prototype for root-ish task deprioritization by withholding tasks on the scheduler #6560
Comments
This comment was marked as outdated.
This comment was marked as outdated.
We now have draft PRs for:
Obviously, ignoring co-assignment makes the implementation much simpler. The key question is going to be, how important is co-assignment? I imagine we'll want it eventually. But perhaps fixing root task overproduction is such a big win that workloads that currently only succeed thanks to co-assignment will still succeed under a naive root-task-withholding implementation that doesn't co-assign. (Some example workloads to try in #2602 (comment).) So I think the next step is to do some basic benchmarking between these two PRs (and That's what I plan to do today. |
If it turns out co-assignment isn't that important, then we can just focus on #6614, and possibly merging it soon under a feature flag (there are still a couple bugs to fix). If it turns out co-assignment is still critical for some workloads, or offers a huge performance/memory gain, then that means that we probably have to accept some complexity if we are going to withhold tasks at all. Aside: the question of "maybe we shouldn't withhold tasks then, maybe we should just do STA" has come up. IMO we're probably going to want root task withholding even with STA. Or at least, I don't think we have nearly enough understanding of what the STA heuristics would be to confidently say we won't need root task withholding with STA. One big question with STA will be how to steal/rebalance tasks when workers join/leave. Stealing is already problematic enough on just runnable tasks (#6600). The idea of having nearly every task pre-assigned to a worker and needing rebalance them all while maintaining consistency seems daunting. A great way to limit the complexity of rebalancing under STA would be to withhold root tasks, which brings the number of tasks workers actually know about and need to be rebalanced down from O(n_tasks) to O(n_threads), which is probably a 10-100x reduction. So that's one reason why I don't think withholding root tasks (as an overall approach) is necessarily a stopgap we'd throw out during a major scheduling overhaul like STA. Therefore, if benchmarking shows that co-assignment is critical, then we should think about where we'd be willing to accept the complexity (in a task-batch-identifying algorithm, or in a load-rebalancing algorithm). If it isn't critical, then we should get task withholding merged without it, and think about co-assignment as another incremental change sometime in the future. |
Benchmarking resultsI ran #6614 and #6598 on a few representative workloads. I don't think co-assignment has enough of a memory benefit for it to be an immediate requirement. Adding co-assignment to root-task withholding seems to somewhat reduce runtimes, but confusingly, tends to somewhat increase memory usage. (That's such an odd result, it makes a a little skeptical of my implementation of #6598, but I've tested it and it seems to be correct?) Across the board, withholding root tasks improves memory usage, often significantly (5x-100x), though occasionally it's the same. It sometimes decreases runtime, and sometimes increases it, by significant factors (up to 2x) both ways. Bottom line: fixing root task overproduction seems to make workloads feasible that are not feasible today (without absurd memory requirements). Removing co-assignment does not seem to make anything infeasible that works today—just maybe slower.
|
Thanks @gjoseph92 - those are some really interesting results! I definitely think that the improvements in memory footprint are compelling. That said, I would be really interested to see these benchmarks for different (mainly larger) numbers of workers. I haven't looked at any of the implementation in detail, but it would be good to show that these improvements are approximately independent of |
Great benchmark results @gjoseph92 ! I agree with @JSKenyon that we will want a couple more benchmarks to confirm these results and we'll likely need to iterate on the implementation and review it more thorouhgly. I opened a follow up ticket to pursue the implementation without co-assignment (see ticket for some explanation why) Instead of doing more manual benchmarking I would rather like us to put this into an automated benchmarks, see (using Coiled and coiled-runtime) |
This looks great so far @gjoseph92 ! For my vorticity example from #6571, is the "something else going on here" related to "widely-shared dependencies" #5325? (I'm not sure if you ran the full problem or a simplified version for that benchmarking case.) |
First, exciting results. I'm excited. Also, kudos on the clear presentation of those results. If it is easy to share performance reports for the shuffling case I would be curious to see them. |
@gjoseph92 I'm curious, any luck digging into these and seeing what was going on? |
There was a small typo in my code for the So the results showing an enormous decrease in shuffle memory were inaccurate, in that they didn't apply to shuffling. They were accurate in that the workload I accidentally tested (which is also a common workload) did see ~9x lower memory usage without production and shorter runtime, but it was definitely mis-named. I've edited #6560 (comment) to correct this.
|
Root task overproduction is a significant problem for memory pressure and runtime
#6360 has shown that we can find a way to deal with the problem isolated in the scheduler (e.g. w/out full STA) by limiting the number of tasks assigned to the worker at a given time
Implementing this logic might have significant impact on our scheduling policies and we want to review the design and implementation thoroughly and perform necessary performance benchmark before committing to it
Expectations
Out of scope / Follow up
The text was updated successfully, but these errors were encountered: