-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better logging for worker removal #8517
Conversation
2bdcbc8
to
50def75
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files + 4 27 suites +4 9h 58m 0s ⏱️ + 2h 18m 2s For more details on these failures, see this check. Results for commit 7db2c8b. ± Comparison against base commit cbf939c. This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
await f | ||
|
||
self.sync(_) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block is redundant with the same inside Client._close
.
This changes the close()
and __exit__()
behaviour of a synchronous Client that started its own LocalCluster (e.g. client = Client()
), so that the client first closes itself, thus releasing all tasks on the scheduler, and only afterwards it closes the cluster instead of the other way around.
Note that asynchronous clients already behave this way.
Without this, as the workers close, the scheduler would observe the sudden loss of all tasks the client had in who_wants
.
See test: test_quiet_client_close
@@ -7222,7 +7260,7 @@ async def _track_retire_worker( | |||
close: bool, | |||
remove: bool, | |||
stimulus_id: str, | |||
) -> tuple[str | None, dict]: | |||
) -> tuple[str, Literal["OK", "no-recipients"], dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could have used True/False but this is IMHO more readable
3e4b876
to
e96f88a
Compare
e96f88a
to
fe55f41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @crusaderky! Consider the nits non-blocking.
"action": "remove-worker", | ||
"processing-tasks": processing_keys, | ||
"lost-computed-tasks": recompute_keys, | ||
"lost-scattered-tasks": lost_keys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also log the erred
tasks here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't lose the exception of erred tasks when you lose the workers where they ran.
distributed/tests/test_client.py
Outdated
@@ -5196,7 +5196,7 @@ def test_quiet_client_close(loop): | |||
threads_per_worker=4, | |||
) as c: | |||
futures = c.map(slowinc, range(1000), delay=0.01) | |||
sleep(0.200) # stop part-way | |||
sleep(0.2) # stop part-way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is only cosmetical but is there a better stop condition here? E.g., n tasks being in memory already? Consider this non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
It looks like the rather verbose |
fe55f41
to
7db2c8b
Compare
Improve all regular logger output as well as
Scheduler.events
around worker shutdown.Scheduler.events
now gives a more complete picture of graceful worker retirement as well as worker failure@jrbourbeau best effort for 2024.2.1