-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor restart() and restart_workers() #8550
Conversation
@@ -851,6 +851,7 @@ async def kill( | |||
assert self.status in ( | |||
Status.running, | |||
Status.failed, # process failed to start, but hasn't been joined yet | |||
Status.closing_gracefully, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers a race condition where Worker.close(nanny=True)
ran successfully up until and including the message to the nanny, and then hanged or crashed - for example because of a bad plugin or extension teardown.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 29 files ± 0 29 suites ±0 11h 28m 31s ⏱️ + 16m 19s For more details on these failures, see this check. Results for commit 3d3140f. ± Comparison against base commit 8927bfd. This pull request removes 4 and adds 5 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
6f461dd
to
dd738bd
Compare
timeout = parse_timedelta(timeout, "s") | ||
timeout = parse_timedelta(timeout, "s") | ||
if timeout is None: | ||
raise ValueError("None is an invalid value for global client timeout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many places in the client code that just blindly assume it's not None. The json schema does not allow for None either.
44e66f2
to
132c8bb
Compare
All test failures are unrelated |
132c8bb
to
7680b85
Compare
7680b85
to
8dd42e1
Compare
3f50a2c
to
30dab21
Compare
0b49fbd
to
3010f86
Compare
I suppose another realistic scenario would be a task running on the worker that repeatedly blocks the GIL for an extended period of time? Granted, there's probably not much added value to a simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @crusaderky! I love to see their behavior finally aligned. A few nits around code style/typing; feel free to ignore.
raise_for_error: bool = True, | ||
) -> dict[str, str]: | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
): | |
) -> dict[str, Literal["OK" , "removed", "timed out"]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We return Awaitable[dict] | dict
depending if self.sync is true or false, and functions should never return Union (python/typing#566)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for the clarification.
timeout: float = 30, | ||
wait_for_workers: bool = True, | ||
stimulus_id: str, | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: While we align functionality, would it make sense to return {worker address: "OK", "no nanny", or "timed out" or error message}
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense but I'd rather treat it as material for a follow-up PR
3010f86
to
0395cfd
Compare
Co-authored-by: Hendrik Makait <[email protected]>
7ad6443
to
3d3140f
Compare
Refactor
Client.restart()
andClient.restart_workers()
to use the same machinery for worker restart.This is propaedeutic to #8537, where the scheduler will need to call restart_workers() internally.
Changed behaviour
restart()
will no longer return self. It's weird and it was never advertised in the documentation anyway.restart_workers()
, the scheduler will now unilaterally remove workers where the nanny failed to terminate the worker within timeout seconds. This aligns its behaviour torestart()
.restart_workers()
has changed from infinity todistributed.comm.timeouts.connect
* 4 (2 minutes by default). This aligns its behaviour torestart()
. The method no longer accepts timeout=None.restart_workers()
no longer forwards the exception to the client if the worker fails to restart. While this is, strictly speaking, a loss of functionality, IMHO I can think of very few of use cases when this can happen, none of which sound particularly interesting to me. Note that this is specifically a use case where the nanny is healthy and in the middle of an RPC call with the scheduler, while the worker is so busted that it fails to restart within the timeout. The only semi-realistic use case I can think of is a WorkerPlugin that fails to acquire an external resource (e.g. connect to a database) on startup.