Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: remove listener closing race #253

Merged
merged 1 commit into from
Aug 1, 2023

Conversation

flotter
Copy link
Contributor

@flotter flotter commented Jul 4, 2023

We use http.Server with a number of custom net.Listener instances.

This patch makes some changes to the daemon.Stop(...) function.

In order for the server instance to Shutdown() gracefully, the listener
termination must be performed by the shutdown process itself. The
server shutdown process is thread-safe, and tracks the listeners in
use. Thread safety is important because the http.Server has two places
where listeners can be terminated:

  1. The thread hosting the Serve() method, using a deferred mutex
    protected closing function for all registered listeners.

  2. The Shutdown() method (with mutex protection), which will close
    listeners async resulting in an error from the listener Accept(),
    but this special case is handled in Serve() and returns the
    ErrServerClosed error which means shutdown was done cleanly.

However, in daemon.go we also pre-close all the listeners before the
call to Shutdown(), which causes two kinds of errors:

  1. accept tcp [::]:: use of closed network connection

This error is actually masked because of tomb.Kill() (when state is dying)
for the termination of the daemon. See the daemon.Start() exit code.

  1. close tcp [::]:: use of closed network connection

This is the actual race condition symptom that can sometimes be seen in
CI logs. Since the listeners are pre-closed, this error is expected. However,
errors discovered during deferred closure in the Serve() thread are discarded
while errors in Shutdown() as reported.

The race is between the two closing paths. If the Shutdown path wins the
sync.Once race, we see the error. If the deferred Serve() path wins, we see
nothing.

This race is only triggered because we manually close the listener first.

A stand-alone simulation of the problem can be seen here:

https://gist.github.com/flotter/483d233d38f0961a52f7ef7405887012

The following changes are introduced:

  • Remove the early manual listener closing code so its properly done by
    the Shutdown() call.

@flotter flotter requested review from benhoyt and anpep July 4, 2023 15:45
internals/daemon/daemon.go Outdated Show resolved Hide resolved
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic sounds reasonable to me. Thanks again for the detailed analysis, @flotter!

@benhoyt
Copy link
Contributor

benhoyt commented Jul 5, 2023

I also pinged @mvo5 on Mattermost, as from a quick glance at snapd's daemon.go, this same issue applies there.

We use http.Server with a number of custom net.Listener instances.

This patch makes some changes to the daemon.Stop(...) function.

In order for the server instance to Shutdown() gracefully, the listener
termination must be performed by the shutdown process itself. The
server shutdown process is thread-safe, and tracks the listeners in
use. Thread safety is important because the http.Server has two places
where listeners can be terminated:

1. The thread hosting the Serve() method, using a deferred mutex
   protected closing function for all registered listeners.

2. The Shutdown() method (with mutex protection), which will close
   listeners async resuling in an error from the listener Accept(),
   but this special case is handled in Serve() and returns the
   ErrServerClosed error which means shutdown was done cleanly.

However, in daemon.go we also pre-close all the listeners before the
call to Shutdown(), which causes two kinds of errors:

1. accept tcp [::]:<port>: use of closed network connection

This error is actually masked because of tomb.Kill() (when state is dying)
for the termination of the daemon. See the daemon.Start() exit code.

2. close tcp [::]:<port>: use of closed network connection

This is the actual race condition symptom that can sometimes be seen in
CI logs. Since the listeners are pre-closed, this error is expected. However,
errors discovered during deferred closure in the Serve() thread are discarded
while errors in Shutdown() as reported.

The race is between the two closing paths. If the Shutdown path wins the
sync.Once race, we see the error. If the deferred Serve() path wins, we see
nothing.

This race is only triggered because we manually close the listener first.

A standalone simulation of the problem can be seen here:

https://gist.github.com/flotter/483d233d38f0961a52f7ef7405887012

The following changes are introduced:

- Remove the early manual listener closing code so its properly done by
  the Shutdown() call.
Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread hosting the Serve() method, using a deferred mutex protected closing function for all registered listeners.

Where are these Close calls? Both Serve goroutines are fired in the Start method, and there are no Close calls there?

@benhoyt
Copy link
Contributor

benhoyt commented Jul 12, 2023

@niemeyer The close calls Fred's referring to are the defer l.Close() in the net/http Server.Serve method. Shutdown closes all the listeners and gracefully waits for things to finish, so we don't need to call Close manually at all.

@benhoyt
Copy link
Contributor

benhoyt commented Jul 13, 2023

@niemeyer's comments from Mattermost for the record:

I see, thanks for the details. It smells slightly strange as the initialization of these listeners is done separately in Init, and the goroutines are only started inside Start, so the disconnection feels brittle. It's also unclear to me how this logic I think is unchanged from snaps, and yet we don't have issues there. Wonder what happened in between.

Easiest path is probably to just organize things a bit better. I don't see, for instance, what's the value of keeping these listeners around if Serve is going to kill them.

Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just merge this for now. The interlock between the Init of the Daemon, and the Start/Stop pair, don't feel ideal. For example, if we Init the daemon, we already get open listeners, but to get rid of the open listeners, we need to Stop it. Except, if we run Init/Stop without Start, it will crash because d.serve is not there yet. So it does need polishing.

With that said, given the above it also feels like dropping the explicit closing won't make it any worse, since Shutdown is called unconditionally on that sequence. So +1 on just merging this now and polishing it later.

@jnsgruk jnsgruk merged commit e214581 into canonical:master Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants