You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of the worker ensure_communicating will continue to fetch dependencies for as long as there are dependencies to fetch. This can push an already overloaded worker over the edge and cause it to fail.
A paused worker should not be allowed to fetch more data.
There are two possible ways to achieve this
Add another guard to ensure_communicating to stop scheduling additional gather_dep coroutines
Remove all tasks from a paused worker that aren't in memory. This would indirectly empty the data_needed heap and cause a worker to stabilize. This could be achieved by either aggressively stealing or by implementing a custom scheduler handler.
I think both options have a certain appeal. I'm wondering which one is the best to choose, specifically in context of the latest changes to AMM / retirement / pause.
This is my preferred choice, as it's most likely the one that adds the least complexity. However it also means that if anybody were to disable stealing, they would also face both this issue and #3761. I think #3761 is fairly intuitive to correlate to stealing; this issue less so. So the question is who, if anybody, ever disables stealing.
The current implementation of the worker
ensure_communicating
will continue to fetch dependencies for as long as there are dependencies to fetch. This can push an already overloaded worker over the edge and cause it to fail.distributed/distributed/worker.py
Lines 2684 to 2687 in 8734c9d
A paused worker should not be allowed to fetch more data.
There are two possible ways to achieve this
ensure_communicating
to stop scheduling additionalgather_dep
coroutinesdata_needed
heap and cause a worker to stabilize. This could be achieved by either aggressively stealing or by implementing a custom scheduler handler.I think both options have a certain appeal. I'm wondering which one is the best to choose, specifically in context of the latest changes to AMM / retirement / pause.
cc @crusaderky
Note: right now, network traffic is only restricted for egress, i.e. incoming
get_data
requests from other workers, seedistributed/distributed/worker.py
Lines 1712 to 1716 in 8734c9d
The text was updated successfully, but these errors were encountered: