Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paused workers should not be able to fetch any more data #5702

Closed
fjetter opened this issue Jan 25, 2022 · 1 comment · Fixed by #6195
Closed

Paused workers should not be able to fetch any more data #5702

fjetter opened this issue Jan 25, 2022 · 1 comment · Fixed by #6195
Assignees
Labels

Comments

@fjetter
Copy link
Member

fjetter commented Jan 25, 2022

The current implementation of the worker ensure_communicating will continue to fetch dependencies for as long as there are dependencies to fetch. This can push an already overloaded worker over the edge and cause it to fail.

while self.data_needed and (
len(self.in_flight_workers) < self.total_out_connections
or self.comm_nbytes < self.comm_threshold_bytes
):

A paused worker should not be allowed to fetch more data.

There are two possible ways to achieve this

  1. Add another guard to ensure_communicating to stop scheduling additional gather_dep coroutines
  2. Remove all tasks from a paused worker that aren't in memory. This would indirectly empty the data_needed heap and cause a worker to stabilize. This could be achieved by either aggressively stealing or by implementing a custom scheduler handler.

I think both options have a certain appeal. I'm wondering which one is the best to choose, specifically in context of the latest changes to AMM / retirement / pause.

cc @crusaderky

Note: right now, network traffic is only restricted for egress, i.e. incoming get_data requests from other workers, see

if self.status == Status.paused:
max_connections = 1
throttle_msg = " Throttling outgoing connections because worker is paused."
else:
throttle_msg = ""

@crusaderky
Copy link
Collaborator

Remove all tasks from a paused worker that aren't in memory.

This approach would fix at the same time this issue and

aggressively stealing

This is my preferred choice, as it's most likely the one that adds the least complexity. However it also means that if anybody were to disable stealing, they would also face both this issue and #3761. I think #3761 is fairly intuitive to correlate to stealing; this issue less so. So the question is who, if anybody, ever disables stealing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants