Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor restart() and restart_workers() #8550

Merged
merged 2 commits into from
Mar 21, 2024

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Mar 1, 2024

Refactor Client.restart() and Client.restart_workers() to use the same machinery for worker restart.

This is propaedeutic to #8537, where the scheduler will need to call restart_workers() internally.

Changed behaviour

  • restart() will no longer return self. It's weird and it was never advertised in the documentation anyway.
  • in restart_workers(), the scheduler will now unilaterally remove workers where the nanny failed to terminate the worker within timeout seconds. This aligns its behaviour to restart().
  • default timeout for restart_workers() has changed from infinity to distributed.comm.timeouts.connect * 4 (2 minutes by default). This aligns its behaviour to restart(). The method no longer accepts timeout=None.
  • restart_workers() no longer forwards the exception to the client if the worker fails to restart. While this is, strictly speaking, a loss of functionality, IMHO I can think of very few of use cases when this can happen, none of which sound particularly interesting to me. Note that this is specifically a use case where the nanny is healthy and in the middle of an RPC call with the scheduler, while the worker is so busted that it fails to restart within the timeout. The only semi-realistic use case I can think of is a WorkerPlugin that fails to acquire an external resource (e.g. connect to a database) on startup.

@@ -851,6 +851,7 @@ async def kill(
assert self.status in (
Status.running,
Status.failed, # process failed to start, but hasn't been joined yet
Status.closing_gracefully,
Copy link
Collaborator Author

@crusaderky crusaderky Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers a race condition where Worker.close(nanny=True) ran successfully up until and including the message to the nanny, and then hanged or crashed - for example because of a bad plugin or extension teardown.

Copy link
Contributor

github-actions bot commented Mar 1, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   11h 28m 31s ⏱️ + 16m 19s
 4 056 tests + 1   3 934 ✅  -  1    109 💤 ±0  13 ❌ +2 
54 902 runs  +32  52 408 ✅ +36  2 409 💤  - 6  85 ❌ +2 

For more details on these failures, see this check.

Results for commit 3d3140f. ± Comparison against base commit 8927bfd.

This pull request removes 4 and adds 5 tests. Note that renamed tests count towards both.
distributed.tests.test_client ‑ test_restart_workers_exception[False]
distributed.tests.test_client ‑ test_restart_workers_exception[True]
distributed.tests.test_client ‑ test_restart_workers_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_timeout[True]
distributed.tests.test_client ‑ test_restart_workers_exception
distributed.tests.test_client ‑ test_restart_workers_kill_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_kill_timeout[True]
distributed.tests.test_client ‑ test_restart_workers_restart_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_restart_timeout[True]

♻️ This comment has been updated with latest results.

timeout = parse_timedelta(timeout, "s")
timeout = parse_timedelta(timeout, "s")
if timeout is None:
raise ValueError("None is an invalid value for global client timeout")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many places in the client code that just blindly assume it's not None. The json schema does not allow for None either.

@crusaderky crusaderky force-pushed the restart_refactor branch 6 times, most recently from 44e66f2 to 132c8bb Compare March 4, 2024 18:33
@crusaderky crusaderky changed the title [WIP] Refactor restart() and restart_workers() Refactor restart() and restart_workers() Mar 4, 2024
@crusaderky crusaderky marked this pull request as ready for review March 4, 2024 18:37
@crusaderky crusaderky requested a review from fjetter as a code owner March 4, 2024 18:37
@crusaderky
Copy link
Collaborator Author

All test failures are unrelated

@crusaderky crusaderky marked this pull request as draft March 5, 2024 15:09
@crusaderky crusaderky marked this pull request as ready for review March 5, 2024 15:41
@crusaderky crusaderky force-pushed the restart_refactor branch 2 times, most recently from 3f50a2c to 30dab21 Compare March 5, 2024 16:21
@crusaderky crusaderky mentioned this pull request Mar 8, 2024
4 tasks
@crusaderky crusaderky force-pushed the restart_refactor branch 4 times, most recently from 0b49fbd to 3010f86 Compare March 12, 2024 13:00
@hendrikmakait hendrikmakait self-requested a review March 19, 2024 13:12
@hendrikmakait
Copy link
Member

hendrikmakait commented Mar 20, 2024

restart_workers() no longer forwards the exception to the client if the worker fails to restart. While this is, strictly speaking, a loss of functionality, IMHO I can think of very few of use cases when this can happen, none of which sound particularly interesting to me. Note that this is specifically a use case where the nanny is healthy and in the middle of an RPC call with the scheduler, while the worker is so busted that it fails to restart within the timeout. The only semi-realistic use case I can think of is a WorkerPlugin that fails to acquire an external resource (e.g. connect to a database) on startup.

I suppose another realistic scenario would be a task running on the worker that repeatedly blocks the GIL for an extended period of time? Granted, there's probably not much added value to a simple timed out message in that case.

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky! I love to see their behavior finally aligned. A few nits around code style/typing; feel free to ignore.

raise_for_error: bool = True,
) -> dict[str, str]:
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
):
) -> dict[str, Literal["OK" , "removed", "timed out"]]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We return Awaitable[dict] | dict depending if self.sync is true or false, and functions should never return Union (python/typing#566)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for the clarification.

distributed/utils_test.py Outdated Show resolved Hide resolved
distributed/client.py Outdated Show resolved Hide resolved
distributed/client.py Outdated Show resolved Hide resolved
distributed/client.py Outdated Show resolved Hide resolved
timeout: float = 30,
wait_for_workers: bool = True,
stimulus_id: str,
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: While we align functionality, would it make sense to return {worker address: "OK", "no nanny", or "timed out" or error message} here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense but I'd rather treat it as material for a follow-up PR

distributed/scheduler.py Outdated Show resolved Hide resolved
@crusaderky crusaderky merged commit 84213ac into dask:main Mar 21, 2024
20 of 36 checks passed
@crusaderky crusaderky deleted the restart_refactor branch March 21, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants