Add back Worker.transition_fetch_missing #6112

mrocklin · 2022-04-12T18:00:12Z

Fixes #5951

In #5653
we removed the fetch -> missing transition
This caused deadlocks.
Now we add it back in.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

Fixes dask#5951 In dask#5653 we removed the fetch -> missing transition. This caused deadlocks. Now we add it back in.

jakirkham · 2022-04-12T18:00:49Z

cc @fjetter @gjoseph92

mrocklin · 2022-04-12T18:02:34Z

@fjetter I'd like your review on this if possible.

mrocklin · 2022-04-12T18:26:22Z

distributed/tests/test_worker.py

+        end="2000-01-10",
+    )
+    s = df.shuffle("id", shuffle="tasks")
+    await c.compute(s.size)


Kudos to @gjoseph92 and @nils-braun for the test

gjoseph92 · 2022-04-12T18:20:14Z

distributed/tests/test_worker.py

+
+@pytest.mark.slow
+@gen_cluster(client=True, Worker=BreakingWorker)
+async def test_broken_comm(c, s, a, b):


I am somewhat ok with this test, since it does reliably trigger the behavior. But I think @fjetter was hoping to see a more minimized case.

I agree with that desire. I encourage folks to work on that. I think that this suffices.

This test does not reliably trigger the condition for me. I do hit it but it is not deterministic

I can increase the data volume and it will become more and more likely. I don't have a deterministic test. I think that it would be good to have. I think that this suffices though.

gjoseph92 · 2022-04-12T18:25:17Z

distributed/worker.py

@@ -1929,6 +1930,14 @@ def transition_flight_missing(
        ts.done = False
        return {}, []

+    def transition_fetch_missing(
+        self, ts: TaskState, *, stimulus_id: str
+    ) -> RecsInstrs:


I'd like to see some assertions about ts in here. Likely at least

assert not ts.done # ??? really not sure about this. I find `done` confusingly named. if self.validate: assert ts.state == "fetch" assert not ts.who_has

This is in validate_task_fetch already, which gets called at the end of every transitions call. I think that we're safe here.

You mean validate_task_missing?

Both validation methods assert the correct (I think) who_has state

def validate_task_fetch(self, ts): assert ts.key not in self.data assert self.address not in ts.who_has assert not ts.done assert ts in self.data_needed assert ts.who_has for w in ts.who_has: assert ts.key in self.has_what[w] assert ts in self.pending_data_per_worker[w] def validate_task_missing(self, ts): assert ts.key not in self.data assert not ts.who_has assert not ts.done assert not any(ts.key in has_what for has_what in self.has_what.values()) assert ts in self._missing_dep_flight

So after this transition is called, the validate_task_missing method will be called, and verify that not ts.who_has

Hrm, it looks like we are asserting the incoming state by default though.

This seems a little odd to me given how few lines of code there are in between getting the state and choosing this method (below for convenience)

start = ts.state func = self._transitions_table.get((start, cast(str, finish)))

I'd be happy to add it for now as convention is you like

It looks like it's inconsistently done. I'm going to pass on this one unless you feel strongly about it. If you do speak up and I'll add it. I'm probably hesitating here just because it feels weird to keep doing something just because we've been doing it.

I strongly suggest not to rely on these validate calls. They are helpful but do not replace testing. They raise an exception. The exception is lost in tornado and the only thing we see is an error log. Sometimes that causes the tests to get stuck but it's not reliable.
I haven't seen these problems recently but it's been a big problem a few months ago

I strongly suggest not to rely on these validate calls. They are helpful but do not replace testing.

We should talk about this in more depth if you want to rely less on them. Validation testing has, historically, been invaluable in maintaining correctness and stability.

This is still testing. The differnce now is that the tests themselves trigger certain behaviors, and assertions are checked in a more systematic way. It is, I think, a better way of verifying state than explicitly checking state in every test. This would be, I think, an inefficient way of writing our tests.

The exception is lost in tornado and the only thing we see is an error log. Sometimes that causes the tests to get stuck but it's not reliable

If tests are passing even when these are not then that's certainly an issue and we should address it quickly. You might not be saying this though. If these aren't as ergonomic as we'd like then let's see if we can make them more ergonomic.

Alternatively, if we have a good alternative for the validation methods then I'm happy to engage. I would be -1 to putting explicit state testing at this into all of the tests though. I'm curious to learn what other alternatives there might be. Could I ask you to raise an issue with thoughts and we can move the conversation there?

I agree with you, they are useful and I don't want to get rid of it. I'm just saying that we cannot blindly rely on them at this point in time. The way our concurrency model with tornado works is that AssertionError are just lost and logged. Sometimes that causes a worker to deadlock which then is a good thing because the test times and fails. However, it depends on where this assert is exactly called and relying on this "implicit deadlock" is not great.

To counter this, I proposed #4735 a while ago which proposes to log an exception and close the worker if a transition error occurs. I believe this would be a drastic behaviour but still a sane one, even for production. If anything goes wrong during state transitions, we should throw the worker away and rely on the scheduler to clean up the mess.
Thoughts?

We only run in validation mode in tests anyway, so I'm totally fine with it.

gjoseph92 · 2022-04-12T18:27:47Z

distributed/worker.py

@@ -2671,6 +2680,10 @@ def ensure_communicating(self) -> None:
            if ts.state != "fetch":
                continue

+            if not ts.who_has:
+                self.transition(ts, "missing", stimulus_id=stimulus_id)


I like having this safety net. I still think it would also be appropriate to add

diff --git a/distributed/worker.py b/distributed/worker.py index 7a062876..5e72f007 100644 --- a/distributed/worker.py +++ b/distributed/worker.py @@ -2991,6 +2991,9 @@ class Worker(ServerNode): for d in has_what: ts = self.tasks[d] ts.who_has.remove(worker) + if not ts.who_has: + # TODO send `missing-data` to scheduler? + recommendations[ts] = "missing" except Exception as e: logger.exception(e)

In this case where we know we're making a task missing, it seems better to immediately transition it.

I'll take a look and try to incorporate this shortly. Thank you for the suggestion.

Thank you for the suggestion. I've verified that this independently fixes the problem and pushed up this change.

I don't like these "safety nets". We had "safety nets" all over the place that not only covered up actually severe problems but also made everything much more complicated than necessary, harder to debug, harder to understand, etc.
This is essentially uncovered, dead code. If we ever hit this line something went wrong. If something goes wrong, we should raise and not try to guess what might be a good resolution.
I prefer the fix in gather_dep over this

I'm happy to remove this. Should I add in a if self.validate: check here? That would have caught things previously.

Yes, the validate instead of the transition would be just fine. I do believe this is the only way one could even trigger the transition_fetch_missing transition and I believe we should get rid of it as well

github-actions · 2022-04-12T19:01:05Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 48m 52s ⏱️ + 10m 15s
  2 738 tests +  8   2 658 ✔️ +11     80 💤 - 1 0 ❌ - 2
21 789 runs +64 20 715 ✔️ +67 1 074 💤 - 1 0 ❌ - 2

Results for commit 09b0765. ± Comparison against base commit bd3f47e.

♻️ This comment has been updated with latest results.

gjoseph92 · 2022-04-12T19:55:21Z

distributed/worker.py

@@ -2999,6 +3012,8 @@ async def gather_dep(
                for d in has_what:
                    ts = self.tasks[d]
                    ts.who_has.remove(worker)
+                    if not ts.who_has:
+                        recommendations[ts] = "missing"


Logging here might be helpful for future debugging. Probably shouldn't call it missing-dep, so it can be differentiated from the finally case, but something in the same spirit.

It also might be worth a comment on why we don't send missing-data to the scheduler in this case, but do send it in the other case of a missing dep. (Because in this case, we don't know whether the dep is actually missing from the worker—we're just preemptively assuming it is because we're assuming the whole worker has died, but we don't want to send the scheduler wrong information based on that assumption.)

I've added logging. I haven't added the comment. I felt that I didn't understand the reasoning here sufficiently well to argue for one way or the other.

There are two things were this code path becomes relevant

Worker is actually dead. The way we implement this handler here acts as if the worker was dead. We basically purge the entire info we know about this worker because we don't expect him to come back.

The scheduler will detect the dead worker eventually and reschedule the task. By not sending the missing-data signal we open ourselves to a minor but harmless race condition where the scheduler distributes "faulty" who_has to other workers for a brief time such that multiple workers may run into this OSError. That's unfortunate from a performance perspective, particularly if they configure very high connect timeouts. The "correct" way would be to send the signal but it shouldn't matter a lot in the end.

There was a network blip. We should actually not do anything in case of a network blip but just retry (w/ backoff). Sending a missing-data in this case might actually be very harmful since the scheduler removes this worker then from its who_has and the worker will never receive the "free-keys" signal, i.e. we'd acquire zombie tasks.

We currently cannot distinguish 1. and 2. so we need to find a middle ground. Purging data is safe because we can reacquire this information. Sending missing-data in the wrong situation has the potential for being unsafe so we should not do it and live with this very minor race condition mentioned above

Now that the fetch->missing transition has been removed, I don't understand how this will work. Not all tasks in has_what are guaranteed to be in flight.

We currently cannot distinguish 1. and 2. so we need to find a middle ground. Purging data is safe because we can reacquire this information. Sending missing-data in the wrong situation has the potential for being unsafe so we should not do it and live with this very minor race condition mentioned above

This explanation is basically what I was looking for in the comment.

(this relies on internal state, I'm not too worried about it)

mrocklin · 2022-04-12T22:38:35Z

@fjetter if you have time tomorrow this could use your reivew. If you're ok with it (or even ok with mild reservations) I encourage you to hit the green button.

fjetter

I would like to not introduce the fetch->missing transition since it is not supposed to be there. Whenever this transition would be hit, our state was already corrupted.

The proper fix is the exception handler in gather_dep. What bothers me is that I'm pretty sure I already fixed this exact problem before. (the finally clause only iterates over fetched tasks, not all tasks). I'll try to find the PR and want to figure out why this is no tested

fjetter · 2022-04-13T08:12:08Z

distributed/worker.py

@@ -1929,6 +1930,14 @@ def transition_flight_missing(
        ts.done = False
        return {}, []

+    def transition_fetch_missing(
+        self, ts: TaskState, *, stimulus_id: str
+    ) -> RecsInstrs:


I strongly suggest not to rely on these validate calls. They are helpful but do not replace testing. They raise an exception. The exception is lost in tornado and the only thing we see is an error log. Sometimes that causes the tests to get stuck but it's not reliable.
I haven't seen these problems recently but it's been a big problem a few months ago

fjetter · 2022-04-13T08:27:25Z

distributed/worker.py

@@ -2671,6 +2680,10 @@ def ensure_communicating(self) -> None:
            if ts.state != "fetch":
                continue

+            if not ts.who_has:
+                self.transition(ts, "missing", stimulus_id=stimulus_id)


I don't like these "safety nets". We had "safety nets" all over the place that not only covered up actually severe problems but also made everything much more complicated than necessary, harder to debug, harder to understand, etc.
This is essentially uncovered, dead code. If we ever hit this line something went wrong. If something goes wrong, we should raise and not try to guess what might be a good resolution.
I prefer the fix in gather_dep over this

fjetter · 2022-04-13T08:49:07Z

distributed/tests/test_worker.py

+
+@pytest.mark.slow
+@gen_cluster(client=True, Worker=BreakingWorker)
+async def test_broken_comm(c, s, a, b):


This test does not reliably trigger the condition for me. I do hit it but it is not deterministic

fjetter · 2022-04-13T09:40:06Z

distributed/worker.py

@@ -2999,6 +3012,8 @@ async def gather_dep(
                for d in has_what:
                    ts = self.tasks[d]
                    ts.who_has.remove(worker)
+                    if not ts.who_has:
+                        recommendations[ts] = "missing"


There are two things were this code path becomes relevant

Worker is actually dead. The way we implement this handler here acts as if the worker was dead. We basically purge the entire info we know about this worker because we don't expect him to come back.

The scheduler will detect the dead worker eventually and reschedule the task. By not sending the missing-data signal we open ourselves to a minor but harmless race condition where the scheduler distributes "faulty" who_has to other workers for a brief time such that multiple workers may run into this OSError. That's unfortunate from a performance perspective, particularly if they configure very high connect timeouts. The "correct" way would be to send the signal but it shouldn't matter a lot in the end.

There was a network blip. We should actually not do anything in case of a network blip but just retry (w/ backoff). Sending a missing-data in this case might actually be very harmful since the scheduler removes this worker then from its who_has and the worker will never receive the "free-keys" signal, i.e. we'd acquire zombie tasks.

We currently cannot distinguish 1. and 2. so we need to find a middle ground. Purging data is safe because we can reacquire this information. Sending missing-data in the wrong situation has the potential for being unsafe so we should not do it and live with this very minor race condition mentioned above

mrocklin · 2022-04-13T14:42:09Z

I spoke with @fjetter , made his requested changes. I plan to merge after tests pass.

gjoseph92

I would like to not introduce the fetch->missing transition since it is not supposed to be there.

@fjetter last time we talked, I thought you were in favor of doing this. What changed?

FWIW I also don't like having the transition, because I don't think we should be doing what we currently do in the OSError handler. (I don't think OSError is the appropriate signal to do what we're doing.) But if we're not going to remove that behavior, then we do need the transition, because the behavior there is that we're making tasks missing which may be in fetch.

gjoseph92 · 2022-04-13T14:48:57Z

distributed/worker.py

-
+                    if not ts.who_has:
+                        recommendations[ts] = "missing"
+                        logger.info(


Ah, please not logger.info, but self.log.append. I want to see this in stories.

gjoseph92 · 2022-04-13T14:50:13Z

distributed/worker.py

@@ -2999,6 +3012,8 @@ async def gather_dep(
                for d in has_what:
                    ts = self.tasks[d]
                    ts.who_has.remove(worker)
+                    if not ts.who_has:
+                        recommendations[ts] = "missing"


Now that the fetch->missing transition has been removed, I don't understand how this will work. Not all tasks in has_what are guaranteed to be in flight.

mrocklin · 2022-04-13T15:13:27Z

CI seems to agree wth @gjoseph92

Traceback (most recent call last):
  File "d:\a\distributed\distributed\distributed\utils.py", line 693, in log_errors
    yield
  File "d:\a\distributed\distributed\distributed\worker.py", line 3065, in gather_dep
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "d:\a\distributed\distributed\distributed\worker.py", line 2582, in transitions
    a_recs, a_instructions = self._transition(
  File "d:\a\distributed\distributed\distributed\worker.py", line 2518, in _transition
    raise InvalidTransition(
distributed.worker_state_machine.InvalidTransition: Impossible transition from fetch to missing for ('split-simple-shuffle-9a64a54870435a4b5560de9df9d2576e', 2, 6)
2022-04-13 15:06:03,744 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x000001ADCE761850>>, <Task finished name='Task-114657' coro=<Worker.gather_dep() done, defined at d:\a\distributed\distributed\distributed\worker.py:2932> exception=InvalidTransition("Impossible transition from fetch to missing for ('split-simple-shuffle-9a64a54870435a4b5560de9df9d2576e', 2, 6)")>)

@fjetter as discussed, I plan to revert and keep the transition in.

This reverts commit b96699a.

fjetter

Both of you were right. Of course, the tasks we forgot to iterate over before are typically in state fetch and we need this transition.
I believe my reasoning about not having a missing-data message is still correct.

mrocklin · 2022-04-13T17:30:50Z

Planning to merge once CI passes

Add back Worker.transition_fetch_missing

94111c5

Fixes dask#5951 In dask#5653 we removed the fetch -> missing transition. This caused deadlocks. Now we add it back in.

mrocklin mentioned this pull request Apr 12, 2022

Worker <-> Worker Communication Failures bring Cluster in inconsistent State #5951

Closed

jakirkham mentioned this pull request Apr 12, 2022

Yank recent dask (distributed) releases? dask/community#238

Open

mrocklin commented Apr 12, 2022

View reviewed changes

gjoseph92 reviewed Apr 12, 2022

View reviewed changes

recommend transition to missing if no who_has

1ad08db

gjoseph92 reviewed Apr 12, 2022

View reviewed changes

mrocklin added 2 commits April 12, 2022 17:35

add logging event

f1df172

relax test

f1d0ab0

(this relies on internal state, I'm not too worried about it)

fjetter reviewed Apr 13, 2022

View reviewed changes

Remove transition_fetch_missing

b96699a

gjoseph92 reviewed Apr 13, 2022

View reviewed changes

mrocklin added 3 commits April 13, 2022 11:00

Revert "Remove transition_fetch_missing"

dfc5a4c

This reverts commit b96699a.

validate rather than fix in ensure_communicating

03e7df9

use structured logs

09b0765

fjetter approved these changes Apr 13, 2022

View reviewed changes

mrocklin merged commit 4f6926e into dask:main Apr 13, 2022

mrocklin deleted the fetch-missing branch April 13, 2022 20:42

gjoseph92 mentioned this pull request Apr 14, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

fjetter mentioned this pull request Apr 20, 2022

Deadlock stealing a resumed task #6159

Closed

crusaderky mentioned this pull request Apr 24, 2022

Task state validation failure for fetch with who_has #6147

Closed

gjoseph92 mentioned this pull request Apr 28, 2022

Deadlock fetching key from retiring worker, when scheduler thinks we already have the key #6244

Closed

fjetter mentioned this pull request Jun 9, 2022

Refactoring gahter_dep / Remove missing data message #6544

Closed

fjetter mentioned this pull request Jun 9, 2022

Refactor gather_dep #6388

Merged

Add back Worker.transition_fetch_missing #6112

Add back Worker.transition_fetch_missing #6112

Conversation

mrocklin commented Apr 12, 2022

jakirkham commented Apr 12, 2022

mrocklin commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 12, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 12, 2022

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 13, 2022

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 13, 2022

fjetter left a comment

Choose a reason for hiding this comment

mrocklin commented Apr 13, 2022

github-actions bot commented Apr 12, 2022 •

edited

Loading