You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
30 seconds (distributed.comm.timeouts.tcp) after the linux kernel of the host running the worker became unresponsive, or
300 seconds (distributed.scheduler.worker-ttl) after the worker's event loop became unresponsive,
whichever happens first.
In both cases, if the nanny is still responsive, it shuts down permanently.
In the second case, where the Worker process is KO but the underlying network and kernel are healthy, we're missing a trick where we should instead ask the nanny to restart the worker.
This is particularly important e.g. on static clusters, where there isn't any additional healing system available.
Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.
Again, this could be fixed by asking the nanny to restart the worker.
An actual example - post mortem
@coiled.function clusters, namely, feature 1 "static" worker that runs on the same host as the scheduler. if that worker dies, nothing will bring it back up. For example, here the worker was struggling with TCP connections, which caused the worker-ttl lapse. Twice.
The first time, it was resurrected by the nanny. The second time it wasn't:
The process:
when the worker-ttl expires, the scheduler calls Scheduler.remove_worker(close=True),
which in turn sends a batched comms message to the worker immediately before they are shut down,
which, when and if it arrives, triggers Worker.close(nanny=True),
which triggers an RPC call from worker to the nanny,
which, when and if it arrives, sets the nanny to status=closing_gracefully,
which, when the worker shuts itself down eventually afterwards, prevents the nanny from restarting it.
What happened in the first case, where the nanny restarted the worker? I think it's that the RPC call from worker to nanny failed, so the worker gave up and shut itself down without informing the nanny first.
The text was updated successfully, but these errors were encountered:
Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.
The scheduler forcefully disconnects a worker
distributed.comm.timeouts.tcp
) after the linux kernel of the host running the worker became unresponsive, ordistributed.scheduler.worker-ttl
) after the worker's event loop became unresponsive,whichever happens first.
In both cases, if the nanny is still responsive, it shuts down permanently.
In the second case, where the Worker process is KO but the underlying network and kernel are healthy, we're missing a trick where we should instead ask the nanny to restart the worker.
This is particularly important e.g. on static clusters, where there isn't any additional healing system available.
Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.
Again, this could be fixed by asking the nanny to restart the worker.
An actual example - post mortem
@coiled.function
clusters, namely, feature 1 "static" worker that runs on the same host as the scheduler. if that worker dies, nothing will bring it back up. For example, here the worker was struggling with TCP connections, which caused theworker-ttl
lapse. Twice.The first time, it was resurrected by the nanny. The second time it wasn't:
The process:
worker-ttl
expires, the scheduler callsScheduler.remove_worker(close=True)
,Worker.close(nanny=True)
,status=closing_gracefully
,What happened in the first case, where the nanny restarted the worker? I think it's that the RPC call from worker to nanny failed, so the worker gave up and shut itself down without informing the nanny first.
The text was updated successfully, but these errors were encountered: