-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow very fast keys and very expensive transfers as stealing candidates #7022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I'm following the intentions of the test case, I think there is something wrong with it. Please double-check.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 9m 14s ⏱️ - 1m 16s For more details on these failures, see this check. Results for commit 84e3d53. ± Comparison against base commit 1314ebb. ♻️ This comment has been updated with latest results. |
b613fe9
to
80216ba
Compare
@@ -1783,7 +1783,7 @@ def __init__(self, scheduler, **kwargs): | |||
self.last = 0 | |||
self.source = ColumnDataSource( | |||
{ | |||
"time": [time() - 20, time()], | |||
"time": [time() - 60, time()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is obviously unrelated. this slows down the stealing dashboard a bit. 20s rolling window is very fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
distributed/tests/test_steal.py
Outdated
# distribution | ||
max_ntasks_on_worker = 0 | ||
for w in workers: | ||
ntasks_on_worker = len(w.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not important, but there's an off-by-one error since every worker should also have the root task on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire assertion cannot be exact due to many timing issues. Right now, the one worker with the root task is the one that pulls almost all tasks #6573 (comment)
To be clear: you're not expecting this PR to prevent the initial stealing of all the keys. Just that you expect this to allow them to be stolen again back to other workers after the initial bad stealing decision? Reading the title of this PR, very fast keys and very expensive transfers both sound like bad ideas to steal. So as a gut feeling, it seems weird to allow them. However, I also don't think special-casing them the way the stealing code does right now is good, so I do like the way this change reads. |
Exactly. The fundamental issue of not allowing them in the first place, if not necessary is a more involved problem but I have fixes/additional issues upcoming shortly An explanation of why this even happens can be found here #6573 (comment)
I think the title is indeed a bit misleading. What really happens for fast vs slow is also a bit different. Fast keys (i.e. the lower threshold on compute_duration)I don't see a reason why fast keys shouldn't be stolen assuming there is not a lot of network transfer involved. Sure, we have the latencies and all that but an important point of work stealing is that it should only allow any tasks being stolen if there are idle workers. This is currently sometimes violated due to #7002 causing way too aggressive decisions. Overall, if there are indeed empty workers and keys are queued up, why shouldn't we move them somewhere else to increase parallelism? Costly keys (i.e. upper bound for cost_multiplier)Frankly, I'm not entirely sure if this threshold is harmful or not. It's important to point out that the actual stealing decision doesn't happen here but rather here distributed/distributed/stealing.py Lines 410 to 413 in f02d2f9
How to read this: Therefore, even by allowing super heavy tasks in the stealable set/dict, they are actually typically not stolen because usually this check will reject the steal. The entire distributed/distributed/stealing.py Lines 451 to 455 in f02d2f9
I slightly altered the PR title to reflect this |
Additionally, regarding very heavy keys, I'm actually still blocking requests if the cost multiplier is not even listed. We're just increasing the threshold |
Co-authored-by: Gabe Joseph <[email protected]>
…tes (dask#7022) Co-authored-by: Gabe Joseph <[email protected]> Co-authored-by: Hendrik Makait <[email protected]>
This is a partial fix for a couple of issues
These issues do report a situation where one worker greedily steals all/most of the keys and the cluster is no longer able to correct course afterwards. This inability to correct the imbalance again is due to these limits on compute time and cost factor.
The unit test provided ensures that indeed the final computation is happening roughly on all workers uniformly.
Unfortunately, this change can have a lot of unknown side effects impacting all kinds of workloads and therefore opens us to the risk of significant regressions for some yet unknown workflows. For instance, we had to blocklist shuffle-split keys because stealing them can cause catastrophic regressions for shuffling.
Apart from causing bad stealing decisions, this change potentially significantly increases the size of the stealable collections such that a single
Stealing.balance
will consume more CPU time.I still believe this is the right thing to do and I hope that coiled-runtime benchmarks would notify us if any standard workflows are significantly impaired.
We will follow up with a couple of other changes that should mitigate this shortly
This change was part of #4920