-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign worker exponential backoff on busy-gather #6169
Comments
Quick historical note, one of the objectives of the busy signal was to avoid swarming a single worker, yes. A possibly larger objective, and on that's not immediately obvious from the implementation, is that we should spread data out using a tree of workers. The who_has check with the scheduler and the random worker selection help to ensure this behavior in aggregate. I don't see any concern with the new design with regards to this constraint, I just wanted to make sure that it was in your mind as you were working on this. |
Correction: |
The problem about waiting potentially indefinitely on a local worker was raised with a proposed fix in #5206 |
When a worker receives a gather_dep request from another worker, but it is too busy with other transfers already, it will reply
{"status": "busy"}
. When this happens, there are three possible use cases:query_who_has
.worker.py implements an exponential backoff system, implemented in 2018 and never revisited (#2092) which, in theory, prevents it from
query_who_has
It is implemented as follows (pseudocode):
If any other unrelated gather_dep completes or any message arrives from the scheduler, ensure_communicating will try fetching the task again. On a busy worker, this will happen almost instantly.
As there is no way to flag which worker was busy, if the busy worker was either the only local worker or the only worker on the whole cluster holding the data, it will be targeted again with get_data_from_worker.
This means that, almost instantly, the busy worker will reply the same thing and repetitively_busy will bump up a notch - unless a gather_dep from another worker recently terminated and the other worker was not busy.
It also means that there will be multiple identical gather_dep calls running in parallel, all waiting on the sleep.
query_who_has will kick off potentially several seconds later, potentially when it's no longer needed, e.g. because another iteration of ensure_communicating successfully fetched the task from another worker.
On the flip side, the worker may experience a burst of heavy chatter with the scheduler (but not other workers), followed by silence. For example, a bunch of tasks are submitted in a short timeframe, with just enough time between each so that they land in different messages, and now we need to wait for their execution to complete. This may cause repetitively_busy to rocket up to a very high number and sleep for - potentially - centuries before anything kicks off ensure_communicating again. DEADLOCK!
Finally, this interacts with
select_keys_for_gather
, which coagulates individually small fetches into the same message. If there are multiple tasks in fetch status pending and they're fairly large (target_message_size
= 50MB), then a single ensure_communicating will spawn multiple contemporaneous gather_dep requests to the same worker, up tototal_out_connections
. They may all return busy, which in turn will increaserepetitively_busy
by several notches in a single go. Reducingtarget_message_size
may result in a substantial slowdown, not because of theselect_keys_for_gather
system itself, but becauserepetitively_busy
is reaching a higher number in a single iteration of ensure_communicating.Besides its own shortcomings, this design is a blocker to #5896. The refactor removes the periodic kick to ensure_communicating from handle_scheduler, and instead kicks it off from
transition_*_fetch
. This means that the above will enter an infinite loop where the requesting worker starts accumulating more and more gather_dep tasks, and the busy worker is constantly hammered with get_data_from_worker. The sleep time quickly rises into the centuries.A naive solution would be to move the sleep before
self.transition_flight_fetch(ts)
. However, this would mean losing an unplanned, but highly desirable feature of the current design, which is if there are multiple workers holding replicas of the same task and the requesting worker is currently chatty, then another random worker will be tried straight away without waiting for the sleep.Proposed design
busy: bool
flag.The new algorithm works as follows:
Note
If at least one local worker holds a replica, the old algorithm waits indefinitely for a local worker to become non-busy and completely ignores remote workers, thus avoiding the cost of network transfer but incurring in a delay. The new algorithm instead immediately falls back to remote workers.
The text was updated successfully, but these errors were encountered: