-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle TERM signals #206
Conversation
} | ||
|
||
// Exit cleanly if there are no active connections when we exit | ||
if proxyClient.ConnectionsCounter == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this and the read in the for loop use atomic.LoadUint64() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to use atomics on a field in a type from another package - it's breaking encapsulation. I would rather an exported method on type Client which appropriately deals with locking/atomics/whatever so that users of the type can't create race conditions.
Second thought: maybe the entire shutdown sequence should just be a method on type Client. E.g., Client.Shutdown(termTimeout time.Duration) error
, which returns an error if there were still connections in use after the timeout. That way the updating of MaxConnections and the reading of ConnectionsCounter can be done in a way that users of the type don't quite need to understand on their own. Plus, this sort of shutdown could be useful to those using this code as a Go library, so we should expose this functionality there.
if proxyClient.ConnectionsCounter == 0 { | ||
os.Exit(0) | ||
} else { | ||
os.Exit(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There need not be an else block. This exit can be on it's own.
@@ -76,6 +78,7 @@ can be removed automatically by this program.`) | |||
// Settings for limits | |||
maxConnections = flag.Uint64("max_connections", 0, `If provided, the maximum number of connections to establish before refusing new connections. Defaults to 0 (no limit)`) | |||
fdRlimit = flag.Uint64("fd_rlimit", limits.ExpectedFDs, `Sets the rlimit on the number of open file descriptors for the proxy to the provided value. If set to zero, disables attempts to set the rlimit. Defaults to a value which can support 4K connections to one instance`) | |||
termTimeout = flag.Duration("term_timeout", 0, "When set, the proxy will stop accepting new connections and wait for existing connections to close before terminating. Any connections that haven't closed after the timeout will be dropped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you prevent the proxy from accepting new connections once a sigterm is received? I don't see any code for that. Am I missing something?
As an aside, I wanted to mention that I prefer this implementation having a flag for how long to wait, as opposed to what unicorn does, where it has several different signals, SIGINT/SIGTERM kill it immediately, while SIGQUIT is a graceful shutdown that stops accepting connections, and lets the server drain before terminating. What I've found with unicorn is that you can end up writing awkward scripts to call SIGQUIT and have a timeout to send a SIGTERM if it hasn't stopped in a reasonable amount of time. And in my experience, for deployment I generally know my use case pretty well: I will always be killing immediately or doing graceful shutdowns, but rarely do I have a server I want both behaviors for. |
} | ||
|
||
signals := make(chan os.Signal, 1) | ||
signal.Notify(signals, syscall.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider also listening for SIGINT, that way this will also work when run interactively and the user hits ctrl-c
|
||
go func() { | ||
<-signals | ||
logging.Infof("Received TERM signal. Waiting up to %s seconds before terminating.", *termTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing easwars comment, it looks like you want to have a line like proxyClient.MaxConnections = 0
here in order to stop accepting new connections
I believe that covers all of the comments except for @Carrotman42's suggestion on moving everything in to the client. If you decide that all of this belongs in the client, I think I'll have some time the week after next to work on it. My 2 cents: I agree that this code belongs in client from an encapsulation perspective. I've never used the proxy as a client library so I can't comment on the usefulness of a |
Yeah, it was a mistake to have that field exported. I won't suggest deleting it in this PR or anything, but that doesn't preclude you from adding good quality methods for safely accessing the fields and using them in this PR to prevent things from getting worse. One thing that you have added in this PR, though, is modifying the MaxConnections variable from a different goroutine, whereas previously the value never changed during execution (because it is concurrently read by other goroutines). Currently this is a race condition because you're not using atomics; but I highly suggest we don't use atomics on the caller side to do that though because forcing atomics on users of the Client type makes it harder to use correctly. That is not a good long term plan and I would very much like to push against that. |
Friendly ping for another review on this. |
proxy/proxy/client.go
Outdated
return nil | ||
} | ||
|
||
return errors.New("Active connections still exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error strings should start with lowercase letters (as per https://github.com/golang/go/wiki/CodeReviewComments#error-strings). Also, I think we might as well include the number of existing connections to help people debug any problems.
Something like:
active := atomic.LoadUint64(&c.ConnectionsCounter)
if active == 0 {
return nil
}
return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)
proxy/proxy/client_test.go
Outdated
}() | ||
|
||
time.Sleep(100 * time.Millisecond) | ||
if !<-shutdown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just block forever if Shutdown
blocks forever. The test will eventually time out but that's not a great experience IMO. wdyt about exiting early if it takes too long? Something like:
shutdownFinished := false
select{
case <-time.After(100*time.Milliseconds):
case shutdownFinished = <-shutdown:
}
if !shutdownFinished {
t.Errorf("shutdown should have completed quickly because there are no active connections")
}
proxy/proxy/client.go
Outdated
conn.Conn.Close() | ||
return | ||
} | ||
if active > c.MaxConnections && c.MaxConnections > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are storing to MaxConnections using the atomic package, we need to use the atomic package to read the variable or else it's not guaranteed to be data-race safe. Also, a little nit-picky, but I suggest reversing the order of the checks so that the MaxConnections is checked for nonzero before checking the other comparison. Something like:
if max := atomic.LoadUint64(&c.MaxConnections); max > 0 && active > max {
} | ||
|
||
signals := make(chan os.Signal, 1) | ||
signal.Notify(signals, syscall.SIGTERM, syscall.SIGINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need SIGTERM too, or is SIGINT just enough? I believe the more platform-independent way is to not use the syscall package but instead lean on the os package's os.Interrupt
(https://golang.org/pkg/os/#Signal) which is equal to syscall.SIGINT.
I checked the windows versions and syscall.SIGINT
and syscall.SIGTERM
both exist, but who knows what SIGTERM means there and in other places. os.Interrupt == syscall.SIGINT
for all platforms except plan9 anyway (https://golang.org/src/os/exec_posix.go?h=Interrupt#L18) so if SIGINT is enough I say we stick with it by using os.Interrupt
here; if you know you (or others) will want SIGTERM for sure then I think it's OK to commit and fix things later if it causes a real problem for anyone.
|
||
go func() { | ||
<-signals | ||
logging.Infof("Received TERM signal. Waiting up to %s before terminating.", *termTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the information that we are no longer accepting new connections? That way people can correlate logs in their application about not being able to connect with this log indicating that they should expect new connections to fail.
proxy/proxy/client.go
Outdated
// whole length of the timeout. | ||
func (c *Client) Shutdown(termTimeout time.Duration) error { | ||
// Don't allow new connections while terminating. | ||
atomic.StoreUint64(&c.MaxConnections, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait but... MaxConnections == 0
implies there should be no limit on the number of connections, so this doesn't actually stop new connections. We can't use -1 because the type is Uint :( Maybe we should create a private sentinel value of the max uint64 value and use that to indicate "block new connections"; I doubt someone is going to actually set the max number of connections to a value that high so I don't think it'll cause any real problems. (famous last words...)
return | ||
} | ||
if active > c.MaxConnections && c.MaxConnections > 0 { | ||
logging.Errorf("too many open connections (max %d)", c.MaxConnections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a different message in the case that we blocked the connection because we are shutting down? Otherwise these logs during shutdown might be a little confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved because the "block connections" feature was dropped.
Thanks for making the requested changes! Sorry for the delay, I was on my honeymoon :) |
Why does it make sense to stop accepting new connections after SIGTERM? We don't know what the client process will do while it's shutting down - it might be trying to establish a connection to the database? |
@Carrotman42 ready for another review.
|
proxy/proxy/client.go
Outdated
conn.Conn.Close() | ||
return | ||
} | ||
if active > c.MaxConnections && c.MaxConnections > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: order of operation for these booleans:
if c.MaxConnections > 0 && active > c.MaxConnections {
Hey, This doesn't work for us on kubernetes, as our applications gracefully shutting down may well create new connections to do so (we have some that do some DB flushing for example). Could anyone advise if or how this is going to be addressed? |
@Stono - Are you using the |
This PR supersedes #178 and takes in to account (most) of @hfwang's comments. By default, the proxy will automatically terminate when receiving a TERM signal (just like it currently does). People can also specify a timeout (
-term_timeout=30s
) to let the proxy wait for up to a certain amount of time before exiting.If there are no active connections, the proxy will exit early (with a status code of 0). If it waits for the entire timeout period and there are still active connections, it will exit with a status code of 2. I picked 2 because I was seeing exit codes of 2 with the existing version but that could be because I have 2 open connections per proxy.