-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove worker reconnect #6361
Remove worker reconnect #6361
Conversation
Unit Test Results 15 files ± 0 15 suites ±0 7h 7m 11s ⏱️ - 5m 36s For more details on these failures, see this check. Results for commit 305a23f. ± Comparison against base commit ff94776. ♻️ This comment has been updated with latest results. |
3ce0c96
to
c708ef0
Compare
The Nanny can still restart the worker process.
…mid_compute_multiple_states_on_scheduler
…sk_memory_with_resources`
821b99e
to
7c769cd
Compare
Everything is green except one flaky
That feels flaky to me, but I guess it's possible this is related? |
@@ -5058,43 +5026,21 @@ async def gather(self, keys, serializers=None): | |||
) | |||
result = {"status": "error", "keys": missing_keys} | |||
with log_errors(): | |||
# Remove suspicious workers from the scheduler but allow them to | |||
# reconnect. | |||
# Remove suspicious workers from the scheduler and shut them down. | |||
await asyncio.gather( | |||
*( | |||
self.remove_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe heavy-handed. retire_workers
would probably be less disruptive, but slower and a little more complex. I don't love losing all the other keys on a worker just because it's missing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind be heavy handed, especially if things are in a wonky state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete review, but I'm stuck in meetings for the next bit
@@ -5058,43 +5026,21 @@ async def gather(self, keys, serializers=None): | |||
) | |||
result = {"status": "error", "keys": missing_keys} | |||
with log_errors(): | |||
# Remove suspicious workers from the scheduler but allow them to | |||
# reconnect. | |||
# Remove suspicious workers from the scheduler and shut them down. | |||
await asyncio.gather( | |||
*( | |||
self.remove_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind be heavy handed, especially if things are in a wonky state.
In general I'm ok with what's here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we discussed offline was whether the Nanny should restart the worker in cases when the connection to the scheduler is broken unexpectedly. The answer is probably yes, but as this PR stands, Worker.close
will always tell the Nanny it's closing gracefully, preventing restart:
distributed/distributed/worker.py
Lines 1475 to 1477 in ff94776
if nanny and self.nanny: | |
with self.rpc(self.nanny) as r: | |
await r.close_gracefully() |
As a follow-up PR, we should add an option to Worker.close
to control this behavior.
I may not be understanding the statement above, but we already have the |
Also make worker log whatever error the scheduler gives it upon connection
Great point, I just missed that. Yes, we can set |
@mrocklin actually, I think maybe we should do this in a follow-up PR. It's pretty straightforward, but involves a little bit of thinking and a few more tests (the nanny is going to try to report the worker closure to the scheduler at the same time the worker's connection is breaking from it shutting down). Easy enough, but I'd rather focus on the core change here and not increase the scope. |
`reconnect=True` (previous default) is now the only option. This is not a necessary change to make. It just simplifies things not not have it. See discussion in dask#6361 (comment).
Unnecessary after dask#6361 is merged
was leaking file descriptors, and needed an event loop
Failures (both seem flaky):
|
Thanks @gjoseph92 . This is in. |
When a worker disconnects from the scheduler, close it immediately instead of trying to reconnect.
Also prohibit workers from joining if they have data in memory, as an alternative to #6341.
Just seeing what CI does for now, to figure out which tests to change/remove.
Closes #6350
pre-commit run --all-files