Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close workers if an exception occurs during transition #4735

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Apr 22, 2021

I'm still not sure why things like #4721 do not pop up in our tests. These things are merely logged but we somehow loose the exception. Maybe that's because workers restart or other workers pick up and we're lucky. Anyhow, I realised that we should probably not even try to recover anything in these cases and just draw a very hard line. Most of our transition methods are not atomic and if something during the transaction breaks, we cannot really guarantee the validity of the state. Easiest way to start from scratch is to close gracefully in these situations.

Opinions?

cc @gforsyth

xref #4413

@gforsyth
Copy link
Contributor

During worker refactoring, I've just stared at the stdout of the tests to catch transition errors, or inserted breakpoints into troublesome spots (I mention this only in support of tests catching these errors because the manual method is obviously prone to failures).

My assumption was that the nanny was restarting worker processes that errored out and that allowed tests to succeed even if 20% of the time a given task entered an invalid state.

Maybe we should parametrize some of the worker tests to explicitly disable the nanny?

@fjetter
Copy link
Member Author

fjetter commented Apr 22, 2021

Maybe we should parametrize some of the worker tests to explicitly disable the nanny?

Everything using gen_cluster by default isn't using Nanny since it is much easier to work with it all if there is no extra process in the way. Default is also False for the cluster ctx manager and these two are mostly used for tests. If anyhing, I would've argued that we're not covering Nanny enough :)

So far, I do not believe that corrupt workers are actually closing. Maybe there are code paths where this is is true but I am not aware of any. I think there must be another reason why we do not spot this. My suspicion is at the moment that we're abusing loop.add_callback too often without handling exception properly, see also #4734


Regardless of why tests do not catch this properly, I believe closing like this is also a valid behaviour in real world situations to avoid deadlocks since the "drop exception (almost) silently" just masks the problem. As I stated above, I think we can no longer guarantee a valid state if a transition fails and must therefore close (or otherwise reconstruct the proper state but that's also just prone to failure)

@fjetter
Copy link
Member Author

fjetter commented Apr 29, 2022

Was implemented in #6210

@fjetter fjetter closed this Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants