Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle TERM signals #206

Merged
merged 9 commits into from
Oct 9, 2018
Merged

Gracefully handle TERM signals #206

merged 9 commits into from
Oct 9, 2018

Conversation

supersam654
Copy link
Contributor

This PR supersedes #178 and takes in to account (most) of @hfwang's comments. By default, the proxy will automatically terminate when receiving a TERM signal (just like it currently does). People can also specify a timeout (-term_timeout=30s) to let the proxy wait for up to a certain amount of time before exiting.

If there are no active connections, the proxy will exit early (with a status code of 0). If it waits for the entire timeout period and there are still active connections, it will exit with a status code of 2. I picked 2 because I was seeing exit codes of 2 with the existing version but that could be because I have 2 open connections per proxy.

@kurtisvg kurtisvg requested a review from hfwang September 6, 2018 19:36
@kurtisvg kurtisvg added Status: Review Needed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 6, 2018
}

// Exit cleanly if there are no active connections when we exit
if proxyClient.ConnectionsCounter == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this and the read in the for loop use atomic.LoadUint64() instead?

Copy link
Contributor

@Carrotman42 Carrotman42 Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to use atomics on a field in a type from another package - it's breaking encapsulation. I would rather an exported method on type Client which appropriately deals with locking/atomics/whatever so that users of the type can't create race conditions.

Second thought: maybe the entire shutdown sequence should just be a method on type Client. E.g., Client.Shutdown(termTimeout time.Duration) error, which returns an error if there were still connections in use after the timeout. That way the updating of MaxConnections and the reading of ConnectionsCounter can be done in a way that users of the type don't quite need to understand on their own. Plus, this sort of shutdown could be useful to those using this code as a Go library, so we should expose this functionality there.

if proxyClient.ConnectionsCounter == 0 {
os.Exit(0)
} else {
os.Exit(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There need not be an else block. This exit can be on it's own.

@@ -76,6 +78,7 @@ can be removed automatically by this program.`)
// Settings for limits
maxConnections = flag.Uint64("max_connections", 0, `If provided, the maximum number of connections to establish before refusing new connections. Defaults to 0 (no limit)`)
fdRlimit = flag.Uint64("fd_rlimit", limits.ExpectedFDs, `Sets the rlimit on the number of open file descriptors for the proxy to the provided value. If set to zero, disables attempts to set the rlimit. Defaults to a value which can support 4K connections to one instance`)
termTimeout = flag.Duration("term_timeout", 0, "When set, the proxy will stop accepting new connections and wait for existing connections to close before terminating. Any connections that haven't closed after the timeout will be dropped")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you prevent the proxy from accepting new connections once a sigterm is received? I don't see any code for that. Am I missing something?

@hfwang
Copy link
Contributor

hfwang commented Sep 7, 2018

As an aside, I wanted to mention that I prefer this implementation having a flag for how long to wait, as opposed to what unicorn does, where it has several different signals, SIGINT/SIGTERM kill it immediately, while SIGQUIT is a graceful shutdown that stops accepting connections, and lets the server drain before terminating.

What I've found with unicorn is that you can end up writing awkward scripts to call SIGQUIT and have a timeout to send a SIGTERM if it hasn't stopped in a reasonable amount of time. And in my experience, for deployment I generally know my use case pretty well: I will always be killing immediately or doing graceful shutdowns, but rarely do I have a server I want both behaviors for.

}

signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGTERM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider also listening for SIGINT, that way this will also work when run interactively and the user hits ctrl-c


go func() {
<-signals
logging.Infof("Received TERM signal. Waiting up to %s seconds before terminating.", *termTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echoing easwars comment, it looks like you want to have a line like proxyClient.MaxConnections = 0 here in order to stop accepting new connections

@supersam654
Copy link
Contributor Author

I believe that covers all of the comments except for @Carrotman42's suggestion on moving everything in to the client. If you decide that all of this belongs in the client, I think I'll have some time the week after next to work on it.

My 2 cents: I agree that this code belongs in client from an encapsulation perspective. I've never used the proxy as a client library so I can't comment on the usefulness of a client.Shutdown method. However, ConnectionsCounter is already exported field so ensuring that other people don't make the same mistake I did would require a breaking change.

@Carrotman42
Copy link
Contributor

Yeah, it was a mistake to have that field exported. I won't suggest deleting it in this PR or anything, but that doesn't preclude you from adding good quality methods for safely accessing the fields and using them in this PR to prevent things from getting worse.

One thing that you have added in this PR, though, is modifying the MaxConnections variable from a different goroutine, whereas previously the value never changed during execution (because it is concurrently read by other goroutines). Currently this is a race condition because you're not using atomics; but I highly suggest we don't use atomics on the caller side to do that though because forcing atomics on users of the Client type makes it harder to use correctly. That is not a good long term plan and I would very much like to push against that.

@supersam654
Copy link
Contributor Author

Friendly ping for another review on this.

return nil
}

return errors.New("Active connections still exist")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error strings should start with lowercase letters (as per https://github.com/golang/go/wiki/CodeReviewComments#error-strings). Also, I think we might as well include the number of existing connections to help people debug any problems.

Something like:

active := atomic.LoadUint64(&c.ConnectionsCounter)
if active == 0 {
   return nil
}
return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)

}()

time.Sleep(100 * time.Millisecond)
if !<-shutdown {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just block forever if Shutdown blocks forever. The test will eventually time out but that's not a great experience IMO. wdyt about exiting early if it takes too long? Something like:

shutdownFinished := false
select{
case <-time.After(100*time.Milliseconds):
case shutdownFinished = <-shutdown:
}
if !shutdownFinished {
    t.Errorf("shutdown should have completed quickly because there are no active connections")
}

conn.Conn.Close()
return
}
if active > c.MaxConnections && c.MaxConnections > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are storing to MaxConnections using the atomic package, we need to use the atomic package to read the variable or else it's not guaranteed to be data-race safe. Also, a little nit-picky, but I suggest reversing the order of the checks so that the MaxConnections is checked for nonzero before checking the other comparison. Something like:

if max := atomic.LoadUint64(&c.MaxConnections); max > 0 && active > max {

}

signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGTERM, syscall.SIGINT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need SIGTERM too, or is SIGINT just enough? I believe the more platform-independent way is to not use the syscall package but instead lean on the os package's os.Interrupt (https://golang.org/pkg/os/#Signal) which is equal to syscall.SIGINT.

I checked the windows versions and syscall.SIGINT and syscall.SIGTERM both exist, but who knows what SIGTERM means there and in other places. os.Interrupt == syscall.SIGINT for all platforms except plan9 anyway (https://golang.org/src/os/exec_posix.go?h=Interrupt#L18) so if SIGINT is enough I say we stick with it by using os.Interrupt here; if you know you (or others) will want SIGTERM for sure then I think it's OK to commit and fix things later if it causes a real problem for anyone.


go func() {
<-signals
logging.Infof("Received TERM signal. Waiting up to %s before terminating.", *termTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the information that we are no longer accepting new connections? That way people can correlate logs in their application about not being able to connect with this log indicating that they should expect new connections to fail.

// whole length of the timeout.
func (c *Client) Shutdown(termTimeout time.Duration) error {
// Don't allow new connections while terminating.
atomic.StoreUint64(&c.MaxConnections, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but... MaxConnections == 0 implies there should be no limit on the number of connections, so this doesn't actually stop new connections. We can't use -1 because the type is Uint :( Maybe we should create a private sentinel value of the max uint64 value and use that to indicate "block new connections"; I doubt someone is going to actually set the max number of connections to a value that high so I don't think it'll cause any real problems. (famous last words...)

return
}
if active > c.MaxConnections && c.MaxConnections > 0 {
logging.Errorf("too many open connections (max %d)", c.MaxConnections)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a different message in the case that we blocked the connection because we are shutting down? Otherwise these logs during shutdown might be a little confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved because the "block connections" feature was dropped.

@Carrotman42
Copy link
Contributor

Thanks for making the requested changes! Sorry for the delay, I was on my honeymoon :)

@jacobat
Copy link

jacobat commented Sep 25, 2018

Why does it make sense to stop accepting new connections after SIGTERM? We don't know what the client process will do while it's shutting down - it might be trying to establish a connection to the database?

@supersam654
Copy link
Contributor Author

@Carrotman42 ready for another review.

  • I made the test and error message changes
  • The original purpose of this whole PR was to make the proxy gracefully handle the SIGTERM signal so I absolutely want to leave that in :)
  • I completely removed any attempt at preventing new connections after the proxy starts shutting down. There doesn't appear to be a consensus that that's even the right direction to go in and I feel like the effort to make it happen is beyond the scope of this original PR. I believe the PR (without anything to do with blocking connections) stands on its own and people should discuss and implement a separate PR if they want to block connections when shutting down.

conn.Conn.Close()
return
}
if active > c.MaxConnections && c.MaxConnections > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: order of operation for these booleans:

if c.MaxConnections > 0 && active > c.MaxConnections {

@Stono
Copy link

Stono commented Oct 16, 2019

Hey,
Sorry to comment on a merged request, but we're observing the situation where new connections are not allowed after receiving the SIGTERM.

This doesn't work for us on kubernetes, as our applications gracefully shutting down may well create new connections to do so (we have some that do some DB flushing for example).

Could anyone advise if or how this is going to be addressed?

@kurtisvg
Copy link
Contributor

@Stono - Are you using the term_timeout flag? IIUC, the proxy shouldn't refuse new connections during the graceful shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants