-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock stealing a resumed
task
#6159
Comments
The code you are linking to is looking just fine. This code path should be hit if the following happens
Is there a comm retry configured? If so, yes, it will retry X times for 300s each if the remote is dead distributed/distributed/worker.py Line 4342 in c9dcbe7
|
This behaviour should be captured by xref #6112 which introduced a subtle change to the behaviour connected to this edge case although I think the change is OK. Most importantly the change this PR introduces happens in the finally clause and/or exception handler of gather_dep but if you are right about the story being done, this code is never reached |
This looks like the comm is never broken, i.e. the worker never detects that the remote is dead. Are there some kind of network proxies involved that would keep the TCP connection alive? |
Are you using the default tornado TCP backend? I've seen such a condition already with a user who implemented their own comm backend which did not properly close a comm if the peer died. |
I think that's the core design problem, and what I meant by "I find it odd". Yes, the code is working as intended, because it's the best we could do with the
This is just your "shuffle in a while loop on Coiled" example, so no special networking, retries, or non-tornado TCP backend. All I did to cause it was take that example and set
This is exactly what we're seeing on the scheduler side (see #6110 (comment) and #6148), so it seems reasonable to assume the same thing is happening in worker<->worker comms. On the scheduler, the comm isn't even timing out after 300s (and the deadlock doesn't resolve itself after 300s), so I'm not sure that that 300s limit (where does it even come from? are you thinking of So that's what seems to be causing the deadlock. If |
Agreed. I thought about this for a while and considered introducing a way to cancel requests a bunch of times but it is not trivial. At the same time, we have a similar problem with execute and I chose to go for the "boring" approach of just waiting it out. I'm happy to revisit this at one point in time.
My knowledge on low level linux is very limited but to my best knowledge, the TCP connection would never be broken up if the kernel is still alive and kicking but the python process is frozen for whatever reason (let it be GIL, an actual process suspension or any other linux wizardry) because the kernel itself will handle TCP keepalive probes (see below)
I don't know where the 300s is coming from. You mentioned this in the OP. Note, that most of the timeouts we configure is on application layer, not on network/tcp layer with the exception of what's configured in |
FYI #6169 but I don't think it's the same problem since for the above condition to trigger, we'd need to see a |
I also floated the idea of a circuit breaker pattern once where the scheduler would be the one entity responsible for detecting dead remotes. If a worker would be flagged as dead, a signal would be broadcasted to all workers notifying them about the dead peer and they'd be instructed to abort all comms to that address. For instance, coupled with a mandatory worker-ttl, this would remove our dependency on TCP keep alives. There are obviously pros and cons |
With
Maybe we're thinking about different things. But what if instead of an interface like def request_gather(key: str, who_has: list[str]) -> GatherRequest:
"""
Request that a key be fetched from peer workers.
Multiple requests for the same key will return different `GatherRequest` objects,
though each will be backed by the same underlying request.
When the underlying request completes, all `GatherRequest`s will complete.
If the underlying request is cancelled (because no `GatherRequest`s need it anymore,
or it is explicitly cancelled), all `GatherRequest`s will be cancelled.
"""
class GatherRequest:
key: str
def release(self) -> None:
"""
Note that this gather request is no longer needed. Idempotent.
If this is the last `GatherRequest` needing the key, the underlying
request is cancelled.
"""
def cancel(self) -> None:
"""
Explicitly cancel the underlying request for all `GatherRequest`s needing this key. Idempotent.
Use when we know the key should not be fetched anymore (the task has instead been stolen
to this worker to compute, for example).
"""
def __del__(self):
self.release()
def __await__(self) -> Awaitable[None]
"Wait until `key` is available in `worker.data`. If underlying request is cancelled, raises CancelledError." This might be easier to reason about and test. |
I think this would be a good idea. It's otherwise hard for a worker to know if the peer it's talking to is unresponsive. Setting a timeout on the |
This issue has become quite stale and the code changed drastically since the report. I'm inclined to close this issue, @gjoseph92 objections? |
Since there's no reproducer, sure. |
abcd
from a peerabcd
Here's the (annotated) worker story for the key in question:
There's definitely something weird about the
processing-released
message arriving right before thecompute-task
message. I can't find an obvious reason in scheduler code why that would happen.But let's ignore that oddity for a second. Pretend it was just a normal work-stealing request that caused the task to be cancelled.
I find it odd that if a worker is told to compute a task it was previously fetching, that it'll resume the fetch:
distributed/distributed/worker.py
Lines 2269 to 2271 in c9dcbe7
If previously we were fetching a key, but now we're being asked to compute it, it seems almost certain that the fetch is going to fail. The compute request should probably take precedence.
I imagine here that we're assuming the
gather_dep
will error out sometime in the future, and when it does, then the key will go fromresumed
towaiting
?Also, this is coming from the #6110 scenario. That's an unusual one in that the TCP connection to the stuck worker doesn't get broken, it's just unresponsive. So I'm also wondering if perhaps
gather_dep
to the stuck worker will hang forever? for 300s (seems to go much longer than that)? for 300s * some retries? Basically, could it be that this isn't quite a deadlock, but a very, very, very long wait for a dependency fetch that might never return until the other worker properly dies? If we don't have any explicit timeouts ongather_dep
already, maybe we should.(All that said, I still think the proper fix would be to not have
transition_cancelled_waiting
try to resume the fetch, but instead go down the compute path. The timeout might be something in addition.)cc @fjetter
The text was updated successfully, but these errors were encountered: