-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tell workers when their peers have left (so they don't hang fetching data from them) #6790
Comments
I've talked about this in various other issues and called this a "circuit breaker" pattern. I think the "remove-worker" message (I would probably use a slightly different name) should then abort or close all open connections to that worker. This would allow all code waiting for a connection to get an OSError similar to what would happen if the timeout is actually hit. That should be a straight forward implementation and we could use a feature toggle to role this behavior out carefully |
This is not that simple for pending AMM |
for this we want to be able to cause all checked out comm use for a particular address (the removed worker) to OSError. additionally as the problem here is exacerbated by the retry_operation in get_data_from_worker distributed/distributed/worker.py Line 2872 in d74f500
_handle_gather_dep_network_failure so that instead of assuming the network error is always fatal treat a network error as a busy worker, so that retries are round-robin rather than just hammering the same worker
|
I think we can safely remove the retry for get_data_from_worker. I believe the worker willhandle this gracefully. It will remove the worker from its internal bookkeeping and, if necessary, asks the scheudler for help.
Yes, that makes sense. I suggest to do this in a dedicated PR. Should be easily isolated from the other changes necessary for this issue. |
There are three tasks that can be worked on separately
|
When a worker breaks its connection to the scheduler, or goes too long without heartbeating, the scheduler removes it and reschedules its tasks to run elsewhere. However, other works may continue trying to fetch data from it far beyond that point. This can lead to something that feels like a deadlock, but isn't quite (if you wait long enough, it will resolve), where workers are waiting for their timeouts to expire requesting data from peers that no longer exist.
This is significantly exacerbated by increasing connection timeouts or retries. For example, if you set
distributed.comm.retry.count = 5
(default is 0) anddistributed.comm.timeouts.connect = 60s
(default is 30s), you might experience a 5-minute deadlock as a worker keeps trying to connect to its dead peer, until it finally gives up and asks the scheduler for a new peer to try. If you setdistributed.comm.retry.count = 5
anddistributed.comm.timeouts.tcp = 5m
(default is 30s), and a peer worker is almost out of memory and freezes up but doesn't close its TCP connection (#6110 #6208 #6177), you could have a 25-minute deadlock.Rather than relying solely on connection timeouts on the worker side, the scheduler should probably inform workers when one of their peers has left. The scheduler is a good source of truth, because workers already have to heartbeat to it, and maintain a long-lived connection (the batched stream).
But naively broadcasting a
remove-worker
message to every worker, every time any worker leaves, probably wouldn't scale well to large clusters. Ideally, we would only inform the workers that are fetching data from the departed worker. We could determine this fromset(wts.processing_on for ts in removed_ws.has_what for wts in ts.waiters if wts.processing_on)
(but something more specific, since not every worker needs to hear the updates about every task).In that message, we should also include the new
who_has
for the task, with the dead worker removed. (And ensure this actually cancels thegather_dep
on workers, and doesn't deadlock them if the task has already transitioned tocancelled
in the meantime.)cc @crusaderky @fjetter
The text was updated successfully, but these errors were encountered: