Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker-ttl timeout should attempt a nanny restart #8537

Closed
crusaderky opened this issue Feb 27, 2024 · 1 comment · Fixed by #8538
Closed

worker-ttl timeout should attempt a nanny restart #8537

crusaderky opened this issue Feb 27, 2024 · 1 comment · Fixed by #8538

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 27, 2024

The scheduler forcefully disconnects a worker

  • 30 seconds (distributed.comm.timeouts.tcp) after the linux kernel of the host running the worker became unresponsive, or
  • 300 seconds (distributed.scheduler.worker-ttl) after the worker's event loop became unresponsive,

whichever happens first.
In both cases, if the nanny is still responsive, it shuts down permanently.

In the second case, where the Worker process is KO but the underlying network and kernel are healthy, we're missing a trick where we should instead ask the nanny to restart the worker.
This is particularly important e.g. on static clusters, where there isn't any additional healing system available.

Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.
Again, this could be fixed by asking the nanny to restart the worker.

An actual example - post mortem

@coiled.function clusters, namely, feature 1 "static" worker that runs on the same host as the scheduler. if that worker dies, nothing will bring it back up. For example, here the worker was struggling with TCP connections, which caused the worker-ttl lapse. Twice.

The first time, it was resurrected by the nanny. The second time it wasn't:
image

The process:

  1. when the worker-ttl expires, the scheduler calls Scheduler.remove_worker(close=True),
  2. which in turn sends a batched comms message to the worker immediately before they are shut down,
  3. which, when and if it arrives, triggers Worker.close(nanny=True),
  4. which triggers an RPC call from worker to the nanny,
  5. which, when and if it arrives, sets the nanny to status=closing_gracefully,
  6. which, when the worker shuts itself down eventually afterwards, prevents the nanny from restarting it.

What happened in the first case, where the nanny restarted the worker? I think it's that the RPC call from worker to nanny failed, so the worker gave up and shut itself down without informing the nanny first.

@crusaderky
Copy link
Collaborator Author

crusaderky commented Feb 29, 2024

Additionally, when the worker process is so borked that it won't even notice that the batched comms have been shut down (e.g. a GIL-holding C package went into an infinite loop, or a poorly written WorkerPlugin or async task did so), the nanny won't restart it.

Reproducer:

async def kill_event_loop():
    while True:
        pass

fut = c.submit(kill_event_loop)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants