Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock: tasks stolen to old WorkerState instance of a reconnected worker #6356

Closed
gjoseph92 opened this issue May 17, 2022 · 2 comments · Fixed by #6585
Closed

Deadlock: tasks stolen to old WorkerState instance of a reconnected worker #6356

gjoseph92 opened this issue May 17, 2022 · 2 comments · Fixed by #6585
Assignees
Labels
bug Something is broken deadlock The cluster appears to not make any progress stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@gjoseph92
Copy link
Collaborator

Task stealing keeps references to WorkerState objects:

self.in_flight[ts] = {
"victim": victim, # guaranteed to be processing_on
"thief": thief,
"victim_duration": victim_duration,
"thief_duration": thief_duration,
"stimulus_id": stimulus_id,
}

If a worker disconnects, then reconnects from the address (it restarts, or the shutdown-reconnect bug happens #6354), task stealing can hold a reference to the old WorkerState object for that address, while the scheduler is working with the new WorkerState object for that address.

If tasks are assigned to this old, stale WorkerState, and then the worker leaves, the tasks will be forever stuck in processing (because they're not recognized as being on the worker that just left).

Full trace-through

Copied from #6263 (comment)

  1. Stealing decides to move a task to worker X.
    1. It queues a steal-request to worker Y (where the task is currently queued), asking it to cancel the task.
    2. Stores a reference to the victim and thief WorkerStates (not addresses) in WorkStealing.in_flight
  2. Worker X gets removed by the scheduler.
  3. Its WorkerState instance—the one currently referenced WorkStealing.in_flight—is removed from Scheduler.workers.
  4. Worker X heartbeats to the scheduler, reconnecting (bug described above).
  5. A new WorkerState instance for it is added to Scheduler.workers, at the same address. The scheduler thinks nothing is processing on it.
  6. Worker Y finally replies, "hey yeah, it's all cool if you steal that task".
  7. move_task_confirm handles this, and pops info about the stealing operation from WorkStealing.in_flight.
  8. This info contains a reference to the thief WorkerState object. This is the old WorkerState instance, which is no longer in Scheduler.workers.
  9. The thief's address is in scheduler.workers, even though the theif object isn't.
  10. The task gets assigned to a worker that, to the scheduler, no longer exists.
  11. When worker X actually shuts itself down, Scheduler.remove_worker goes to reschedule any tasks it's processing. But it's looking at the new WorkerState instance, and the task was assigned to the old one, so the task is never rescheduled.

I think work stealing should either:

In combination with #6354, causes #6263, #6198.

@gjoseph92 gjoseph92 added bug Something is broken stability Issue or feature related to cluster stability (e.g. deadlock) deadlock The cluster appears to not make any progress labels May 17, 2022
@fjetter
Copy link
Member

fjetter commented May 25, 2022

Can this still be a thing after the removal of the worker reconnect?

@fjetter fjetter added the needs info Needs further information from the user label May 25, 2022
@gjoseph92
Copy link
Collaborator Author

gjoseph92 commented May 25, 2022

Yes. A new worker could connect to the cluster with the same address as one that left: #6398 (comment), #6392.

This issue is now much less likely, but still possible. Especially with a nanny that might restart it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken deadlock The cluster appears to not make any progress stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants