-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: investigate Transport's use of cached connections upon resume from sleep #29308
Comments
See also #24595. |
@bradfitz Any idea if this is being worked on / investigated? |
@chris-vest thanks for the ping! Brad is currently very busy and he is also currently on leave but I am going to start looking at this issue next week and hopefully have something for Go1.14. |
I think perhaps one of these might be reversed? After a SIGCONT is sent, the monotonic clock usually accumulates more nanoseconds than the wall clock -- I might be mistaken though.
So firstly When we go to sleep, there are 2 main cases or (3 sub-cases):
I spent quite sometime digging through the standard library as well as the google-cloud-go work and basically these scenarios can be reproduced just by simulating how to stop processes by sending https://github.com/odeke-em/bugs/tree/master/golang/29308 Way forwardPerhaps a diff --git a/src/net/http/transport.go b/src/net/http/transport.go
index 26f642aa7a..138df57765 100644
--- a/src/net/http/transport.go
+++ b/src/net/http/transport.go
@@ -29,6 +29,7 @@ import (
"sync"
"sync/atomic"
"time"
+ _ "unsafe"
"golang.org/x/net/http/httpguts"
"golang.org/x/net/http/httpproxy"
@@ -603,7 +604,10 @@ func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
// creating new connections and retrying if the server
// is just hanging up on us because it doesn't like
// our request (as opposed to sending an error).
- return false
+
+ // As per golang.org/issue/29380, let's only return true if time
+ // drifted.
+ return pc.timeDrifted()
}
if _, ok := err.(nothingWrittenError); ok {
// We never wrote anything, so it's safe to retry, if there's no body or we
@@ -628,6 +632,44 @@ func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
return false // conservatively
}
+// timeDrifted returns true if we can detect that our wall and monotonicNstonic clocks
+// have drifted apart. There are 2 scenarios in which we'll return true:
+// 1. If pconn.t.IdleConnTimeout is non-zero and
+// pconn.idleAt+pconn.t.IdleConnTimeout are before time.Now()
+// 2. If the monotonicNstonic clock difference and the wall clock time differences
+// exceed a heuristic percentage.
+func (pc *persistConn) timeDrifted() bool {
+ pc.mu.Lock()
+ idleAt := pc.idleAt
+ var idleConnTimeout time.Duration
+ if pc.t != nil {
+ idleConnTimeout = pc.t.IdleConnTimeout
+ }
+ pc.mu.Unlock()
+
+ if idleAt.IsZero() {
+ // We need to have a non-zero idleAt in order
+ // to try to guess if we've drifted.
+ return false
+ }
+
+ now := time.Now()
+
+ if idleConnTimeout > 0 {
+ // Since we had a Transport.CloseConnTimeout, if we were put in limbo
+ // and our idle conn cleanup route didn't fire, manually check here.
+ return now.After(idleAt.Add(idleConnTimeout))
+ }
+
+ diffWallNanos := float64(now.UnixNano()-idleAt.UnixNano()) / 1e9
+ diffmonotonicNsNanos := float64(monotonicNs(now)-monotonicNs(idleAt)) / 1e9
+ driftPercentage := 100 * float64(diffmonotonicNsNanos-diffWallNanos)
+
+ // As a heuristic, if the monotonicNstonic nanoseconds and the walltime nanoseconds
+ // have drifted by say over 10%, then report this as a drifted time.
+ return driftPercentage >= 10
+}
+
// ErrSkipAltProtocol is a sentinel error value defined by Transport.RegisterProtocol.
var ErrSkipAltProtocol = errors.New("net/http: skip alternate protocol")
@@ -878,6 +920,9 @@ func (t *Transport) tryPutIdleConn(pconn *persistConn) error {
return nil
}
+//go:linkname monotonicNs time.mono
+func monotonicNs(t time.Time) int64
+
// getIdleConnCh returns a channel to receive and return idle
// persistent connection for the given connectMethod.
// It may return nil, if persistent connections are not being used.
@@ -1429,6 +1474,10 @@ func (t *Transport) dialConn(ctx context.Context, cm connectMethod) (*persistCon
pconn.br = bufio.NewReaderSize(pconn, t.readBufferSize())
pconn.bw = bufio.NewWriterSize(persistConnWriter{pconn}, t.writeBufferSize())
+ // Perhaps we are abusing idleAt but really we want to
+ // capture the time that the connection was created.
+ pconn.idleAt = time.Now()
+
go pconn.readLoop()
go pconn.writeLoop()
return pconn, nil
diff --git a/src/time/time.go b/src/time/time.go
index c8116a74f4..06da8009f3 100644
--- a/src/time/time.go
+++ b/src/time/time.go
@@ -1091,6 +1091,10 @@ func daysIn(m Month, year int) int {
return int(daysBefore[m] - daysBefore[m-1])
}
+func mono(t Time) int64 {
+ return t.ext
+}
+
// Provided by package runtime.
func now() (sec int64, nsec int32, mono int64) ResultsBefore patchserver: 2019/06/21 19:12:54 Listening at "[::]:49373"
client: 2019/06/21 19:12:55 URL: http://[::]:49373/hello
server: 2019/06/21 19:12:55 Request from: [::1]:49376
client: 2019/06/21 19:12:55 Blob: Server responding ASAP
client: 2019/06/21 19:12:55 Pausing for 1.5s
client: 2019/06/21 19:12:57 URL: http://[::]:49373/hello
server: 2019/06/21 19:12:57 Request from: [::1]:49376
server: 2019/06/21 19:12:57 Now going to lag for 3s
server: 2019/06/21 19:12:57 Pausing for 2s before reviving client with pid: 21447
server: 2019/06/21 19:13:27 Request from: [::1]:49377
server: 2019/06/21 19:13:27 Now going to lag for 3s
client: 2019/06/21 19:13:33 Failed to make request("http://[::]:49373/hello"): Get http://[::]:49373/hello: read tcp [::1]:49377->[::1]:49373: read: connection reset by peer After proposed patchserver: 2019/06/21 19:11:00 Listening at "[::]:49346"
client: 2019/06/21 19:11:01 URL: http://[::]:49346/hello
server: 2019/06/21 19:11:01 Request from: [::1]:49349
client: 2019/06/21 19:11:01 Blob: Server responding ASAP
client: 2019/06/21 19:11:01 Pausing for 1.5s
client: 2019/06/21 19:11:02 URL: http://[::]:49346/hello
server: 2019/06/21 19:11:02 Request from: [::1]:49349
server: 2019/06/21 19:11:02 Now going to lag for 3s
server: 2019/06/21 19:11:02 Pausing for 2s before reviving client with pid: 21387
server: 2019/06/21 19:11:32 Request from: [::1]:49360
server: 2019/06/21 19:11:32 Now going to lag for 3s
server: 2019/06/21 19:11:39 Request from: [::1]:49362 Please let me know what y'all think. |
Change https://golang.org/cl/183557 mentions this issue: |
One other thing I discovered today while thinking out loud about this issue is that perhaps the runtime on getting a SIGCONT can refetch the current time and go update all the previous timers that may have drifted. This might even be the simpler and more correct solution instead of the addition to get time.mono. |
@odeke-em, what about https://go-review.googlesource.com/c/go/+/204797? Seems a bit simpler. Could you try it out? |
Roger that and nice work @bradfitz! Let me try it out right now. |
Change https://golang.org/cl/204797 mentions this issue: |
@bradfitz this gist might help in automation of the code for easier feedback loops when making the change https://gist.github.com/odeke-em/639f947edded2f86ae34d286fb12f875#file-main-go |
…) time Both laptops closing their lids and cloud container runtimes suspending VMs both faced the problem where an idle HTTP connection used by the Transport could be cached for later reuse before the machine is frozen, only to wake up many minutes later to think that their HTTP connection was still good (because only a second or two of monotonic time passed), only to find out that the peer hung up on them when they went to write. HTTP/1 connection reuse is inherently racy like this, but no need for us to step into a trap if we can avoid it. Also, not everybody sets Request.GetBody to enable re-tryable POSTs. And we can only safely retry requests in some cases. So with this CL, before reusing an old connection, double check the walltime. Testing was done both with a laptop (closing the lid for a bit) and with QEMU, running "stop" and "cont" commands in the monitor and sending QMP guest agent commands to update its wall clock after the "cont": echo '{"execute":"guest-set-time"}' | socat STDIN UNIX-CONNECT:/var/run/qemu-server/108.qga In both cases, I was running https://gist.github.com/bradfitz/260851776f08e4bc4dacedd82afa7aea and watching that the RemoteAddr changed after resume. It's kinda difficult to write an automated test for. I gave a lightning talk on using pure emulation user mode qemu for such tests: https://www.youtube.com/watch?v=69Zy77O-BUM https://docs.google.com/presentation/d/1rAAyOTCsB8GLbMgI0CAbn69r6EVWL8j3DPl4qc0sSlc/edit?usp=sharing https://github.com/google/embiggen-disk/blob/master/integration_test.go ... that would probably be a good direction if we want an automated test here. But I don't have time to do that now. Updates #29308 (HTTP/2 remains) Change-Id: I03997e00491f861629d67a0292da000bd94ed5ca Reviewed-on: https://go-review.googlesource.com/c/go/+/204797 Reviewed-by: Bryan C. Mills <[email protected]>
This seems like another example where using BOOTTIME would help? (#24595) |
Change https://golang.org/cl/208798 mentions this issue: |
This is the http2 version of CL 204797. Updates golang/go#29308 (fixes once bundled into std) Change-Id: I7cd97d38c941e9a8a62808e23b6533c72760f003 Reviewed-on: https://go-review.googlesource.com/c/net/+/208798 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]>
Change https://golang.org/cl/209077 mentions this issue: |
When the http.Transport has cached persistent connections to a server and the machine is suspended (e.g. laptop lid closing, serverless environments), the monotonic clock might not advance but the wall clock will.
This is a tracking bug to verify how Transport behaves in such cases.
It would be bad if during the machine's time asleep we got a TCP packet to close a connection but we never saw it (due to being asleep), and then upon resume we try to re-use that dead TCP connection, get a write error, and then are unable to retry for whatever reason (non-idempotent or non-replayable request).
Before using a connection we should look at the current wall time and compare it to the wall time of when it was last idle. We might already be doing that (by accident?) but we might also be accidentally using monotonic time, in which case we wouldn't notice the missing chunks of time.
Investigate.
/cc @jadekler
The text was updated successfully, but these errors were encountered: