Redesign worker exponential backoff on busy-gather #6173

crusaderky · 2022-04-21T17:54:41Z

Closes Redesign worker exponential backoff on busy-gather #6169
Blocks Refactor ensure_communicating #6165
Fix multiple bugs where select_keys_for_gather would return less keys than it could. Note that I didn't write unit tests for this, as they would be unhealthily complicated to implement now and are best left to after the state machine is broken out of Worker.

crusaderky · 2022-04-21T21:15:26Z

distributed/worker.py

@@ -1930,6 +1935,8 @@ def transition_missing_fetch(
        ts.state = "fetch"
        ts.done = False
        self.data_needed.push(ts)
+        for w in ts.who_has:
+            self.data_needed_per_worker[w].push(ts)


Fixes bug where tasks transitioning from missing to fetch would not be picked up by select_keys_for_gather

crusaderky · 2022-04-21T21:18:34Z

distributed/worker.py

-        while L:
-            ts = L.pop()
+        while tasks:
+            ts = tasks.peek()


Fixes bug where

a task would make an iteration of select_keys_for_gather exceeds total_bytes

before the fetch from that worker is complete, another task with higher priority is added to data_needed on the same worker

at the next iteration of ensure_communicating, the task is not picked up by select_keys_for_gather

crusaderky · 2022-04-21T21:19:43Z

distributed/worker.py

+                    # Avoid hammering the worker. If there are multiple replicas
+                    # available, immediately try fetching from a different worker.
+                    self.busy_workers.add(worker)
+                    self.io_loop.call_later(0.15, self._readd_busy_worker, worker)


To be replaced with an async instruction within the scope of #5896

crusaderky · 2022-04-21T21:21:59Z

distributed/worker.py

+                    who_has = await retry_operation(
+                        self.scheduler.who_has, keys=refresh_who_has
+                    )
+                    self.update_who_has(who_has)


Notably this query to the scheduler does not happen if all workers that are known to hold a replica are in flight. I suppose that this disparity of treatment is because this worker knows that workers will eventually exit in_flight_workers, while it has no control over busy_workers.

github-actions · 2022-04-21T22:51:28Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 27m 28s ⏱️ - 1m 18s
  2 729 tests +  2   2 648 ✔️ +  4     80 💤 - 1 0 ❌ - 2 1 🔥 +1
21 718 runs +17 20 673 ✔️ +10 1 044 💤 +8 0 ❌ - 2 1 🔥 +1

For more details on these errors, see this check.

Results for commit 5bffa37. ± Comparison against base commit 370f456.

♻️ This comment has been updated with latest results.

crusaderky · 2022-04-21T22:58:55Z

@fjetter @mrocklin ready for review

mrocklin · 2022-04-22T20:22:13Z

Generally what's here seems sensible to me. However, I'm also not going deeply into the logic. I'm mostly trusting @crusaderky and the tests.

I did have a general question though. There are a couple of occasions where you identify and fix a possible bug. Should we invest more time in creating tests for these?

mrocklin · 2022-04-22T20:22:57Z

I think that if @fjetter has ample time on Monday for deeper review then it would be good to wait for that. If that's not the case then I'm comfortable merging.

mrocklin · 2022-04-22T22:44:19Z

distributed/tests/test_worker.py

+    )["f"]
+    g = c.submit(inc, f, key="g", workers=[a.address])
+    assert await g == 2
+    assert_worker_story(a.story("f"), [("receive-dep", lw.address, {"f"})])


Unrelated to this PR, but a quick note to @fjetter

I'm totally fine with uses of stories like this one. I like this because it is a very focused assertion statement. It's clear that we care about this specific thing, rather than copying down the entire transition log. It's also easier to understand the intent from a reader's perspective as well. I get that we're expecting to receive "f" from lw. If this breaks and I have to come fix it in the future I think that I'll be able to quickly understand the point that it was trying to get across. I also think that it's unlikely to break for unrelated reasons.

FWIW I think an even better way to assert this would be to assert on incoming/outgoing transfer logs since receive-dep is technically not a transition and only there for 'historic reasons'. Still, I'm Ok with this

crusaderky · 2022-04-23T00:06:28Z

I did have a general question though. There are a couple of occasions where you identify and fix a possible bug. Should we invest more time in creating tests for these?

Writing tests for select_keys_for_gather is unhealthily complicated today. I'd rather leave it for after the worker state machine refactor.

mrocklin · 2022-04-23T00:28:47Z

Fine by me

fjetter

I just had a minor question about a test but that's not a blocker.

fjetter · 2022-04-25T11:53:19Z

distributed/tests/test_worker.py

+    )["f"]
+    g = c.submit(inc, f, key="g", workers=[a.address])
+    assert await g == 2
+    assert_worker_story(a.story("f"), [("receive-dep", lw.address, {"f"})])


FWIW I think an even better way to assert this would be to assert on incoming/outgoing transfer logs since receive-dep is technically not a transition and only there for 'historic reasons'. Still, I'm Ok with this

fjetter · 2022-04-25T11:57:46Z

distributed/tests/test_worker.py

+    story = b.story("busy-gather")
+    # 1 busy response straight away, followed by 1 retry every 150ms for 500ms.
+    # The requests for b and g are clustered together in single messages.
+    assert 3 <= len(story) <= 7


what's the motivation for changing the "timeout h" to this?

I didn't remove the h timeout? It's on line 1836.
There was no count on the number of retries before.

crusaderky commented Apr 21, 2022

View reviewed changes

Redesign worker exponential backoff on busy-gather

449a5a4

crusaderky force-pushed the busy_worker branch from 785a3da to 449a5a4 Compare April 21, 2022 22:03

crusaderky marked this pull request as ready for review April 21, 2022 22:58

mrocklin approved these changes Apr 22, 2022

View reviewed changes

mrocklin reviewed Apr 22, 2022

View reviewed changes

crusaderky added 2 commits April 23, 2022 01:51

Merge branch 'main' into busy_worker

c2b6d7f

fix

5bffa37

crusaderky self-assigned this Apr 25, 2022

crusaderky added 2 commits April 25, 2022 12:11

Merge branch 'main' into busy_worker

1987d86

fix

a8c6842

fjetter approved these changes Apr 25, 2022

View reviewed changes

crusaderky merged commit 2ef5cf3 into dask:main Apr 25, 2022

crusaderky deleted the busy_worker branch April 25, 2022 13:14

crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 26, 2022

Harmonize dask#6161 and dask#6173

42cdd16

crusaderky added a commit that referenced this pull request Apr 26, 2022

Harmonize #6161 and #6173 (#6207)

84cbb09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign worker exponential backoff on busy-gather #6173

Redesign worker exponential backoff on busy-gather #6173

crusaderky commented Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022

crusaderky Apr 21, 2022 •

edited

Loading

github-actions bot commented Apr 21, 2022 •

edited

Loading

crusaderky commented Apr 21, 2022

mrocklin commented Apr 22, 2022

mrocklin commented Apr 22, 2022

mrocklin Apr 22, 2022

fjetter Apr 25, 2022

crusaderky commented Apr 23, 2022

mrocklin commented Apr 23, 2022

fjetter left a comment

fjetter Apr 25, 2022

fjetter Apr 25, 2022

crusaderky Apr 25, 2022

Redesign worker exponential backoff on busy-gather #6173

Redesign worker exponential backoff on busy-gather #6173

Conversation

crusaderky commented Apr 21, 2022 • edited Loading

crusaderky Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky Apr 21, 2022

Choose a reason for hiding this comment

crusaderky Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Apr 21, 2022 • edited Loading

Unit Test Results

crusaderky commented Apr 21, 2022

mrocklin commented Apr 22, 2022

mrocklin commented Apr 22, 2022

mrocklin Apr 22, 2022

Choose a reason for hiding this comment

fjetter Apr 25, 2022

Choose a reason for hiding this comment

crusaderky commented Apr 23, 2022

mrocklin commented Apr 23, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter Apr 25, 2022

Choose a reason for hiding this comment

fjetter Apr 25, 2022

Choose a reason for hiding this comment

crusaderky Apr 25, 2022

Choose a reason for hiding this comment

crusaderky commented Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022 •

edited

Loading

crusaderky Apr 21, 2022 •

edited

Loading

github-actions bot commented Apr 21, 2022 •

edited

Loading