worker-ttl timeout should attempt a nanny restart #8537

crusaderky · 2024-02-27T11:11:52Z

The scheduler forcefully disconnects a worker

30 seconds (distributed.comm.timeouts.tcp) after the linux kernel of the host running the worker became unresponsive, or
300 seconds (distributed.scheduler.worker-ttl) after the worker's event loop became unresponsive,

whichever happens first.
In both cases, if the nanny is still responsive, it shuts down permanently.

In the second case, where the Worker process is KO but the underlying network and kernel are healthy, we're missing a trick where we should instead ask the nanny to restart the worker.
This is particularly important e.g. on static clusters, where there isn't any additional healing system available.

Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.
Again, this could be fixed by asking the nanny to restart the worker.

An actual example - post mortem

@coiled.function clusters, namely, feature 1 "static" worker that runs on the same host as the scheduler. if that worker dies, nothing will bring it back up. For example, here the worker was struggling with TCP connections, which caused the worker-ttl lapse. Twice.

The first time, it was resurrected by the nanny. The second time it wasn't:

The process:

when the worker-ttl expires, the scheduler calls Scheduler.remove_worker(close=True),
which in turn sends a batched comms message to the worker immediately before they are shut down,
which, when and if it arrives, triggers Worker.close(nanny=True),
which triggers an RPC call from worker to the nanny,
which, when and if it arrives, sets the nanny to status=closing_gracefully,
which, when the worker shuts itself down eventually afterwards, prevents the nanny from restarting it.

What happened in the first case, where the nanny restarted the worker? I think it's that the RPC call from worker to nanny failed, so the worker gave up and shut itself down without informing the nanny first.

The text was updated successfully, but these errors were encountered:

crusaderky · 2024-02-29T17:42:22Z

Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.

Reproducer:

async def kill_event_loop():
    while True:
        pass

fut = c.submit(kill_event_loop)

github-actions bot added the needs triage label Feb 27, 2024

This was referenced Feb 27, 2024

Restart workers when worker-ttl expires #8538

Merged

Unresponsive workers should be flagged on the dashboard #8546

Open

crusaderky mentioned this issue Mar 1, 2024

Refactor restart() and restart_workers() #8550

Merged

crusaderky closed this as completed in #8538 Apr 15, 2024

jacobtomlinson removed the needs triage label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker-ttl timeout should attempt a nanny restart #8537

worker-ttl timeout should attempt a nanny restart #8537

crusaderky commented Feb 27, 2024 •

edited

Loading

crusaderky commented Feb 29, 2024 •

edited

Loading

worker-ttl timeout should attempt a nanny restart #8537

worker-ttl timeout should attempt a nanny restart #8537

Comments

crusaderky commented Feb 27, 2024 • edited Loading

An actual example - post mortem

crusaderky commented Feb 29, 2024 • edited Loading

crusaderky commented Feb 27, 2024 •

edited

Loading

crusaderky commented Feb 29, 2024 •

edited

Loading