Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow very fast keys and very expensive transfers as stealing candidates #7022

Merged
merged 6 commits into from
Sep 9, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Sep 8, 2022

This is a partial fix for a couple of issues

These issues do report a situation where one worker greedily steals all/most of the keys and the cluster is no longer able to correct course afterwards. This inability to correct the imbalance again is due to these limits on compute time and cost factor.

The unit test provided ensures that indeed the final computation is happening roughly on all workers uniformly.

Unfortunately, this change can have a lot of unknown side effects impacting all kinds of workloads and therefore opens us to the risk of significant regressions for some yet unknown workflows. For instance, we had to blocklist shuffle-split keys because stealing them can cause catastrophic regressions for shuffling.

Apart from causing bad stealing decisions, this change potentially significantly increases the size of the stealable collections such that a single Stealing.balance will consume more CPU time.

I still believe this is the right thing to do and I hope that coiled-runtime benchmarks would notify us if any standard workflows are significantly impaired.
We will follow up with a couple of other changes that should mitigate this shortly

This change was part of #4920

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm following the intentions of the test case, I think there is something wrong with it. Please double-check.

distributed/tests/test_steal.py Outdated Show resolved Hide resolved
distributed/tests/test_steal.py Outdated Show resolved Hide resolved
distributed/tests/test_steal.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 9m 14s ⏱️ - 1m 16s
  3 101 tests +1    3 014 ✔️  - 1    85 💤 +1  2 +1 
22 954 runs  +8  22 046 ✔️ +5  906 💤 +2  2 +1 

For more details on these failures, see this check.

Results for commit 84e3d53. ± Comparison against base commit 1314ebb.

♻️ This comment has been updated with latest results.

@@ -1783,7 +1783,7 @@ def __init__(self, scheduler, **kwargs):
self.last = 0
self.source = ColumnDataSource(
{
"time": [time() - 20, time()],
"time": [time() - 60, time()],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is obviously unrelated. this slows down the stealing dashboard a bit. 20s rolling window is very fast

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

distributed/tests/test_steal.py Outdated Show resolved Hide resolved
# distribution
max_ntasks_on_worker = 0
for w in workers:
ntasks_on_worker = len(w.data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not important, but there's an off-by-one error since every worker should also have the root task on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire assertion cannot be exact due to many timing issues. Right now, the one worker with the root task is the one that pulls almost all tasks #6573 (comment)

@gjoseph92
Copy link
Collaborator

These issues do report a situation where one worker greedily steals all/most of the keys and the cluster is no longer able to correct course afterwards

To be clear: you're not expecting this PR to prevent the initial stealing of all the keys. Just that you expect this to allow them to be stolen again back to other workers after the initial bad stealing decision?

Reading the title of this PR, very fast keys and very expensive transfers both sound like bad ideas to steal. So as a gut feeling, it seems weird to allow them. However, I also don't think special-casing them the way the stealing code does right now is good, so I do like the way this change reads.

@fjetter
Copy link
Member Author

fjetter commented Sep 9, 2022

To be clear: you're not expecting this PR to prevent the initial stealing of all the keys. Just that you expect this to allow them to be stolen again back to other workers after the initial bad stealing decision?

Exactly. The fundamental issue of not allowing them in the first place, if not necessary is a more involved problem but I have fixes/additional issues upcoming shortly

An explanation of why this even happens can be found here #6573 (comment)

Reading the title of this PR, very fast keys and very expensive transfers both sound like bad ideas to steal. So as a gut feeling, it seems weird to allow them.

I think the title is indeed a bit misleading. What really happens for fast vs slow is also a bit different.

Fast keys (i.e. the lower threshold on compute_duration)

I don't see a reason why fast keys shouldn't be stolen assuming there is not a lot of network transfer involved. Sure, we have the latencies and all that but an important point of work stealing is that it should only allow any tasks being stolen if there are idle workers. This is currently sometimes violated due to #7002 causing way too aggressive decisions.

Overall, if there are indeed empty workers and keys are queued up, why shouldn't we move them somewhere else to increase parallelism?

Costly keys (i.e. upper bound for cost_multiplier)

Frankly, I'm not entirely sure if this threshold is harmful or not. It's important to point out that the actual stealing decision doesn't happen here but rather here

occ_thief = combined_occupancy(thief)
occ_victim = combined_occupancy(victim)
if occ_thief + cost_multiplier * duration <= occ_victim - duration / 2:

How to read this:
a key is allowed to be stolen iff the thieves combined occupancy (i.e. including pending stealing requests) plus the expected transfer cost (cost_multiplier * duration) is smaller than the new combined occupancy of the victim after the key has been removed (i.e. this actually relies on occupancy (here, duration) having the component of network see #7003 #7004)
To my best understanding at this moment, the factor of 0.5 is there to implement a soft hysteresis but I'm not entirely sure about this factor, yet.

Therefore, even by allowing super heavy tasks in the stealable set/dict, they are actually typically not stolen because usually this check will reject the steal.

The entire steal_time_ratio is, in the end, a performance optimization that approximately sorts keys by their attractiveness for stealing and buckets them (see cost_multiplier) such that we can start our loops at approximately sorted buckets

for level, cost_multiplier in enumerate(self.cost_multipliers):
if not idle:
break
for victim in list(saturated):
stealable = self.stealable[victim.address][level]

I slightly altered the PR title to reflect this

@fjetter fjetter changed the title Allow very fast keys and very expensive transfers to be stolen Allow very fast keys and very expensive transfers as stealing candidates Sep 9, 2022
@fjetter
Copy link
Member Author

fjetter commented Sep 9, 2022

Additionally, regarding very heavy keys, I'm actually still blocking requests if the cost multiplier is not even listed. We're just increasing the threshold

@fjetter fjetter merged commit 77aeecb into dask:main Sep 9, 2022
@fjetter fjetter deleted the stealing_fixes_compute_ratio branch September 9, 2022 11:40
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants