-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(servstate): use WaitDelay to avoid Command.Wait blocking on stdin/out/err #275
Conversation
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are blocked when copying due to a sub-subprocess holding onto them. Read more details in these issues: - golang/go#23019 - golang/go#50436 This isn't the original intent of kill-delay, but it seems reasonable to reuse it in this context. Fixes canonical#149
It's unlikely to be needed here, but it won't hurt either. Use a smaller timeout (1s) as these are intended for short-running commands.
Our style is to start with an uppercase letter (for logs, not errors) and to use "Cannot X" rather than "Error Xing".
We might need special handling of exec.ErrWaitDelay |
@hpidcock No, I don't think we need to handle |
Also factor out setupEmptyServiceManager as the same code is used in 7 places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I have one concern reusing killDelay
- please check if my understand makes sense or not.
There is one subtlety which this new feature may create in the future - since a service restart were never able to restart child process that promoted themselves outside the process group (or session), in the past that may have forced people to someone find ways to kill everything or restart. What we are now enabling is leaving a child behind running, and killing the file descriptors previously used to capture the output (is my understanding correct)? If this is true, a service restart will cause the new primary process to log output without having the stdout of the previous children included (This is assuming the service knows how to deal with reusing the running child, else it may even start another one)?
@flotter Yeah, that understanding is correct. However, I think the behaviour in this PR is strictly better than what we have now, which is causing a deadlock, and causing the charm (or whatever's running Pebble) to not see that the service stopped at all. People that were working around may have to change (or remove!) their workaround, but I think that's okay. Also, I think it was only the data team using Patroni that this affected (at least, that we knew about) -- and they know about this fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding my 2 cents
Use
os.exec
'sCmd.WaitDelay
to ensurecmd.Wait()
returns in a reasonable timeframe if the goroutines thatcmd.Start()
uses to copy stdin/out/err are blocked when copying due to a sub-subprocess holding onto them. Read more details about the issue in golang/go#23019 and the proposed solution (that was added in Go 1.20) in golang/go#50436.This solves issue 149, where Patroni wasn't restarting properly even after a
KILL
signal was sent to it. I had originally mis-diagnosed this problem as an issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). The Patroni process wasn't being marked as finished at all due to being blocked on thecmd.Wait()
. Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introducedWaitDelay
to allow you to easily work around this exact problem.The fix itself is this one-liner:
This will really only be a problem for services, but we make the same change for exec and exec health checks as it won't hurt there either.
Also, as a drive-by, this PR also canonicalises some log messages: our style is to start with an uppercase letter (for logs, not errors) and to use "Cannot X" rather than "Error Xing".
Fixes #149.