-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close worker when we discover an internal error #6206
Comments
+1 I like failing hard and loud. I think even after stabilizing this is a good way to go. If an internal exception occurs we do not have any way to resolving it but restarting the worker. We should ensure to leave enough logging information, if possible. |
Yeah, I'll probably send a packet of information to the scheduler using the eventing system, like what we do with invalid transitions and task states. Probably we should consider shipping those to the client by default as well. |
POC raised here for broad conceptual feedback: #6210 |
xref #6201 for why this is a little tricky, and why it isn't happening already. It's awkward that:
But getting closer to the more structured behavior still is a good step. |
I agree that this is not optimal. I also think that it's fine though.
I think that it's not critical that we'll be leaking running coroutines. My guess is that calling worker.close will cause the worker to close fairly reliably.
Yeah, to be clear I don't mean to say "this will solve all of our problems all of the time". I'm saying "this seems like a really easy and possibly large win for us". I'm happy that folks are thinking long-term, but I also want to make sure that we get to a happy place relatively quickly. |
There have been a few instances where a worker's internal state has been discovered to be inconsistent. This might be a validation error or an invalid transition or some other internal error. We should resolve these and determine what is going on. In the meantime it's also critical that we provide a stable experience to users. These situations are rare enough that it probably makes sense to just close down the worker and have it be restarted, and hope that the scheduler logic (which appears to be in a better state today) cleans things up properly.
This doesn't solve the underlying problem, but does give us a little space and gives users a better experience while we discover and resolve that problem.
The text was updated successfully, but these errors were encountered: