You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"victim": victim, # guaranteed to be processing_on
"thief": thief,
"victim_duration": victim_duration,
"thief_duration": thief_duration,
"stimulus_id": stimulus_id,
}
If a worker disconnects, then reconnects from the address (it restarts, or the shutdown-reconnect bug happens #6354), task stealing can hold a reference to the old WorkerState object for that address, while the scheduler is working with the new WorkerState object for that address.
If tasks are assigned to this old, stale WorkerState, and then the worker leaves, the tasks will be forever stuck in processing (because they're not recognized as being on the worker that just left).
The task gets assigned to a worker that, to the scheduler, no longer exists.
When worker X actually shuts itself down, Scheduler.remove_workergoes to reschedule any tasks it's processing. But it's looking at the new WorkerState instance, and the task was assigned to the old one, so the task is never rescheduled.
Task stealing keeps references to
WorkerState
objects:distributed/distributed/stealing.py
Lines 245 to 251 in b8b45c6
If a worker disconnects, then reconnects from the address (it restarts, or the shutdown-reconnect bug happens #6354), task stealing can hold a reference to the old
WorkerState
object for that address, while the scheduler is working with the newWorkerState
object for that address.If tasks are assigned to this old, stale
WorkerState
, and then the worker leaves, the tasks will be forever stuck in processing (because they're not recognized as being on the worker that just left).Full trace-through
Copied from #6263 (comment)
steal-request
to worker Y (where the task is currently queued), asking it to cancel the task.victim
andthief
WorkerState
s (not addresses) inWorkStealing.in_flight
WorkerState
instance—the one currently referencedWorkStealing.in_flight
—is removed fromScheduler.workers
.WorkerState
instance for it is added toScheduler.workers
, at the same address. The scheduler thinks nothing is processing on it.move_task_confirm
handles this, and pops info about the stealing operation fromWorkStealing.in_flight
.thief
WorkerState
object. This is the oldWorkerState
instance, which is no longer inScheduler.workers
.thief
's address is inscheduler.workers
, even though thetheif
object isn't.Scheduler.remove_worker
goes to reschedule any tasks it's processing. But it's looking at the newWorkerState
instance, and the task was assigned to the old one, so the task is never rescheduled.I think work stealing should either:
id
of theWorkerState
instance, instead of a direct reference. Verify thatid(scheduler.workers[addr]) == expected_id
. An advantage is that this avoids reference leaks ofWorkerState
objects (though they should eventually be cleaned up when the task completes) Ensure TaskState instances are released on Scheduler and Worker #6250.d["thief"] is self.scheduler.workers[theif.address]
In combination with #6354, causes #6263, #6198.
The text was updated successfully, but these errors were encountered: