net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

josharian · 2020-01-22T19:41:04Z

What version of Go are you using (`go version`)?

Go 1.14 beta 1

Does this issue reproduce with the latest release?

Not sure; I can't compile my program with 1.13.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/josh/Library/Caches/go-build"
GOENV="/Users/josh/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/josh"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/josh/go/1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/josh/go/1.14/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/josh/x/skiff/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/1t/n61cbvls5bl293bbb0zyypqw0000gn/T/go-build473522411=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Made a series of http requests from a Go client to a Go server. The requests were vanilla HTTP requests, using http.Get.

I lost connectivity at some point. When I regained connectivity, all subsequent http requests failed with error read: connection reset by peer. I waited quite a long time, and it never recovered.

It has happened a couple of times, but doesn't reproduce reliably (which is unsurprising, since me closing my laptop lid is not exactly a precision affair).

This looks very similar to #34978, although I don't know when exactly the connectivity failed.

Tentatively marking as Go 1.14.

cc @bradfitz @fraenkel @tombergan

The text was updated successfully, but these errors were encountered:

fraenkel · 2020-01-23T01:12:50Z

You were probably affected by #24138

josharian · 2020-01-23T20:17:41Z

You were probably affected by #24138

Ugh. Thanks. This is about (not) having timeouts on the client side, right? It's unfortunate that it affects brand new requests and that it permanently breaks all new requests.

Could we add a (when-broken-only?) connection re-use timeout without breaking the compatibility promise? It seems like connection re-use decisions are entirely up to net/http.

Any suggestions for workarounds? It's tricky since I am actually doing long polling requests, where I actively want to keep connections open for hours, so I don't want to set aggressive timeouts. Would it help to call http.DefaultClient.CloseIdleConnections() whenever an http request fails? Or start using a new http.Client whenever an http request fails?

networkimprov · 2020-01-23T21:16:06Z

A different culprit than lack of timeout might be lack of resume-after-suspend notification. That would let us ping the server on resume and reconnect if nec. Also discussed in #36141 (comment)

Maybe unrelated, but the CL for #23459 (net.Dial TCP keepalive on by default) didn't touch net/http. Does http.Client not use net.Dial, or is this disabled for http.Client by a different CL?

bcmills · 2020-01-23T21:21:13Z

Given that we don't know whether 1.13 was affected, I suspect this should not be milestoned to 1.14 at this point.

josharian · 2020-01-23T21:35:01Z

@bcmills I milestoned based on superficial similarity to #34978, which has outstanding backports to 1.13. Feel free to re-milestone, although it might be better to understand a bit more first.

fraenkel · 2020-01-24T01:10:53Z

@josharian Did you enable http2 on the Server? If so, curious what happens with http/1.1.

josharian · 2020-01-24T01:56:18Z

I used net/http out of the box. I will run for a bit with http2 disabled and see what happens.

networkimprov · 2020-01-24T22:56:15Z

In 1.13 (and presumably earlier; not sure), http.DefaultTransport uses TCP keepalive. On client suspend/resume where the server drops the link (e.g. due to no traffic), subsequent client Write() disables TCP keepalive, preventing a timeout error from the last client Read() -- see #31490. I don't know whether this is preventing your http.DefaultClient from recovering, but thought I should mention it.

Setting a short keepalive helped in my case (a plain TCP/TLS context, not HTTP). Eventually I plan to disable keepalive and use a server pulse to detect a dead link.

From net/http/transport.go:

// DefaultTransport is the default implementation of Transport and is
// used by DefaultClient. It establishes network connections as needed
// and caches them for reuse by subsequent calls. It uses HTTP proxies
// as directed by the $HTTP_PROXY and $NO_PROXY (or $http_proxy and
// $no_proxy) environment variables.
var DefaultTransport RoundTripper = &Transport{
        Proxy: ProxyFromEnvironment,
        DialContext: (&net.Dialer{
                Timeout:   30 * time.Second,
                KeepAlive: 30 * time.Second,
                DualStack: true,
        }).DialContext,
        ForceAttemptHTTP2:     true,
        MaxIdleConns:          100,
        IdleConnTimeout:       90 * time.Second,
        TLSHandshakeTimeout:   10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
}

bradfitz · 2020-01-25T02:22:55Z

@networkimprov, that seems unrelated.

josharian · 2020-01-26T04:46:19Z

I finally tracked this down.

It turns out that there was a code path in which in I didn't close resp.Body. Timeouts didn't help.

Although this was user error (sorry and thanks!), I do wonder whether there's a way to detect this, since it was not easy to find.

One option is a vet check that, similar to the lostcancel check, checks that resp.Body.Close gets called on all exits from a function. Another option would be some kind of check at runtime in the http package, although I don't know net/http well enough to know whether that is feasible.

josharian · 2020-01-26T05:09:32Z

Correction: timeouts were also necessary (in addition to properly closing resp.Body).

josharian · 2020-01-26T05:31:09Z

Closing as user error.

bcmills · 2020-01-27T17:20:16Z

@josharian, could you provide some more detail about how this resulting from a missing resp.Body.Close?

The documentation for the Body field says:

The default HTTP client's Transport may not
reuse HTTP/1.x "keep-alive" TCP connections if the Body is
not read to completion and closed.

It does not say anything about what happens for HTTP/2 connections, so at the very least there seems to be a documentation issue here.

bcmills · 2020-01-27T17:23:23Z

This also seems like something we could detect and report, using a finalizer-based approach similar to the one described in #24739: if the Body becomes unreachable without being closed, that always indicates a caller error, and could be reported as such (perhaps as a panic, or perhaps by some other means).

bradfitz · 2020-01-27T17:48:41Z

Yeah, I don't think forgetting to close resp.Body should cause anything worse than a TCP connection leak. It certainly shouldn't affect unrelated HTTP fetches.

josharian · 2020-01-27T17:57:15Z

I'm honestly not sure. I tried to reproduce using multiple simpler ssetup and failed. There was clearly something else going on as well that closing the Body helped with. I went looking to try to understand better, but I've already spent over a full day on this, and I'm not really inclined to spend more time digging.

I do think a finalizer-based unclosed unreachable Body detection would be a good idea, since that is definitely a bug, regardless of what precise consequences it has. I suspect it'd be better to log instead of panicking--net/http already logs when it notices an ineffectual WriteHeader call, so there's some precedent for it.

Not closing the response body presumably claused flakyness in the registry replacer when it tries to read a response body[0], see[1] for why this can happen. As this seems to be a common mistake in our codebase, this change activates a linter for it. [0] https://issues.redhat.com/browse/DPTP-1692 [1] golang/go#36700

rogpeppe · 2020-12-04T15:33:13Z

Possibly related: I just saw this error message in our CI system. I couldn't reproduce it, but I can't see why closing idle connections should cause this kind of error:

Delete "http://127.0.0.1:35617/api/v2/variables/020f755c3c082000": net/http: HTTP/1.x transport connection broken: http: CloseIdleConnections called

fraenkel · 2020-12-04T17:29:17Z

The above might have just been fixed via #41600 and #42026.

siredwin · 2021-10-17T03:17:48Z

This makes so much sense.
I have two API calls, one has
c.Response().Header().Set("Connection", "keep-alive")
defer c.Request().Body.Close() // Added this one
and the other one
c.Response().Header().Set("Connection", "keep-alive")

Guess which one keeps giving me write: connection reset by peer?

#599 went in a wrong direction, so I revert it here: it makes all `TestMultipleClients` tests flaky since not all members are necessarily joined after the updates Instead, to fix the `TestMultipleClientsWithMixedLabelsAndExpectFailure` test, I add a couple more labeled members, that way, even if we reach 3 updates (which happened some times), we'll never get to 5 with a single member Also, try to fix the `TestTLSServerWithLocalhostCertWithClientCertificateEnforcementUsingClientCA1` test by closing GRPC connections. Maybe this (golang/go#36700) is the issue?

* Really fix flaky `TestMultipleClients` tests #599 went in a wrong direction, so I revert it here: it makes all `TestMultipleClients` tests flaky since not all members are necessarily joined after the updates Instead, to fix the `TestMultipleClientsWithMixedLabelsAndExpectFailure` test, I add a couple more labeled members, that way, even if we reach 3 updates (which happened some times), we'll never get to 5 with a single member Also, try to fix the `TestTLSServerWithLocalhostCertWithClientCertificateEnforcementUsingClientCA1` test by closing GRPC connections. Maybe this (golang/go#36700) is the issue? * Set `MaxIdleConnsPerHost` * New attempt to fix this test. Could it be aborted connections? * Retry RSTs

seankhliao · 2024-12-03T18:59:50Z

i suppose someone can open a new issue if they can still reproduce it.

josharian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 22, 2020

josharian added this to the Go1.14 milestone Jan 22, 2020

josharian modified the milestones: Go1.14, Backlog Jan 25, 2020

josharian closed this as completed Jan 26, 2020

bcmills reopened this Jan 27, 2020

bcmills added the Documentation label Jan 27, 2020

bcmills changed the title ~~net/http: permanently broken connection with error "read: connection reset by peer"~~ net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed Jan 27, 2020

alvaroaleman mentioned this issue Nov 17, 2020

Close http bodies and add linter for it openshift/ci-tools#1423

Merged

nikhil-bhat mentioned this issue Nov 3, 2023

Prometheus scrape not able to recover from tcp write broken pipe prometheus/prometheus#13094

Closed

julienduchesne mentioned this issue Oct 9, 2024

Really fix flaky TestMultipleClients tests grafana/dskit#603

Merged

gabyhelp mentioned this issue Oct 31, 2024

net/http: http.Get fails with "connection reset by peer" after update to Go 1.23 but works flawlessly up to 1.22.8 #70139

Closed

seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

josharian commented Jan 22, 2020

fraenkel commented Jan 23, 2020

josharian commented Jan 23, 2020

networkimprov commented Jan 23, 2020

bcmills commented Jan 23, 2020

josharian commented Jan 23, 2020 •

edited

Loading

fraenkel commented Jan 24, 2020 •

edited

Loading

josharian commented Jan 24, 2020

networkimprov commented Jan 24, 2020

bradfitz commented Jan 25, 2020

josharian commented Jan 26, 2020

josharian commented Jan 26, 2020

josharian commented Jan 26, 2020

bcmills commented Jan 27, 2020

bcmills commented Jan 27, 2020

bradfitz commented Jan 27, 2020

josharian commented Jan 27, 2020

rogpeppe commented Dec 4, 2020

fraenkel commented Dec 4, 2020

siredwin commented Oct 17, 2021

seankhliao commented Dec 3, 2024

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

Comments

josharian commented Jan 22, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

fraenkel commented Jan 23, 2020

josharian commented Jan 23, 2020

networkimprov commented Jan 23, 2020

bcmills commented Jan 23, 2020

josharian commented Jan 23, 2020 • edited Loading

fraenkel commented Jan 24, 2020 • edited Loading

josharian commented Jan 24, 2020

networkimprov commented Jan 24, 2020

bradfitz commented Jan 25, 2020

josharian commented Jan 26, 2020

josharian commented Jan 26, 2020

josharian commented Jan 26, 2020

bcmills commented Jan 27, 2020

bcmills commented Jan 27, 2020

bradfitz commented Jan 27, 2020

josharian commented Jan 27, 2020

rogpeppe commented Dec 4, 2020

fraenkel commented Dec 4, 2020

siredwin commented Oct 17, 2021

seankhliao commented Dec 3, 2024

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

josharian commented Jan 23, 2020 •

edited

Loading

fraenkel commented Jan 24, 2020 •

edited

Loading