Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Re-connect with upgraded HTTP2 connection fails to send Request.body #32441

Closed
brianfife opened this issue Jun 5, 2019 · 23 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@brianfife
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.12.5 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/bfife/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/bfife/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/x7/yphtc2sj4ml2lrflpwdblvrxsmcq9d/T/go-build801671760=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

This is a follow up of issues #25009 #28206

We are sending requests as follows:

proto.transport = &http.Transport{}
proto.client = &http.Client{Transport: proto.transport}
req, err := http.NewRequest(request.Method, url, bytes.NewBuffer(request.Body))
resp, err := proto.client.Do(req)

What did you expect to see?

Good behaviour

With the above client configuration, I was able to determine the following code was sending out the request using the RoundTrip function noted here:

https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L427

	if t.useRegisteredProtocol(req) {
		altProto, _ := t.altProto.Load().(map[string]RoundTripper)
		if altRT := altProto[scheme]; altRT != nil {
			if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
				return resp, err
			}
		}
	}

In our case, for the first and following requests t.useRegisteredProtocol(req) returns true, so altRT.RoundTrip(req) tries to send via h2_bundle's RoundTrip. As no connection exists at this point in time, this will fail due to error "http2: no cached connection was available", bubbling up ErrSkipAltProtocol "net/http: skip alternate protocol", and the code continues on. n.b. subsequent requests will be sent via altRT.RoundTrip(req) if there are cached connections available, however a GoAway received from the server will trigger the same behaviour of advancing to the next section of code.

https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L447

	for {
		select {
		case <-ctx.Done():
			req.closeBody()
			return nil, ctx.Err()
		default:
		}

		// treq gets modified by roundTrip, so we need to recreate for each retry.
		treq := &transportRequest{Request: req, trace: trace}
		cm, err := t.connectMethodForRequest(treq)
		if err != nil {
			req.closeBody()
			return nil, err
		}

		// Get the cached or newly-created connection to either the
		// host (for http or https), the http proxy, or the http proxy
		// pre-CONNECTed to https server. In any case, we'll be ready
		// to send it requests.
		pconn, err := t.getConn(treq, cm)
		if err != nil {
			t.setReqCanceler(req, nil)
			req.closeBody()
			return nil, err
		}

		var resp *Response
		if pconn.alt != nil {
			// HTTP/2 path.
			t.decHostConnCount(cm.key()) // don't count cached http2 conns toward conns per host
			t.setReqCanceler(req, nil)   // not cancelable with CancelRequest
			resp, err = pconn.alt.RoundTrip(req)
		} else {
			resp, err = pconn.roundTrip(treq)
		}
		if err == nil {
			return resp, nil
		}
...

In the above for loop, for the first request without any connections available, a new transport connection is created, and pconn.alt.RoundTrip(req) or pconn.RoundTrip(req) is then used to submit and fulfill the Request, and this likely completes successfully. So far, so good.

As noted above, subsequent requests from our client continue to be sent from altRT.RoundTrip(req), at which point they are returned, without reaching the for loop.

Bad behaviour

After some time, our load balancer may respond with a GoAway response after a full request is transmitted, including data from the Request.Body, as seen with the following debug:

2019/06/04 02:22:26.750150 http2: Framer 0xc000470a80: wrote HEADERS flags=END_HEADERS stream=17 len=12
2019/06/04 02:22:26.750321 http2: Framer 0xc000470a80: wrote DATA stream=17 len=125 data="STRINGOFDATA"
2019/06/04 02:22:26.750386 http2: Framer 0xc000470a80: wrote DATA flags=END_STREAM stream=17 len=0 data=""
2019/06/04 02:22:26.753144 http2: Framer 0xc000470a80: read GOAWAY len=8 LastStreamID=15 ErrCode=NO_ERROR Debug=""
2019/06/04 02:22:26.753217 http2: Transport readFrame error on conn 0xc0001a3e00: (*errors.errorString) EOF
2019/06/04 02:22:26.753291 http2: Transport failed to get client conn for server:443: http2: no cached connection was available

This results in another transport being created in the for loop shown above, however the Request.Body is not being reset before the request is sent on the newly created transport connection.

The rewind code at https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L497 doesn't get executed if pconn.alt.RoundTrip(req) or pconn.RoundTrip(req) are successful in sending a request after a connection is re-established. From my limited testing, I think the only time the rewind is reached is when TLSNextProto is manually set as it is in TestTransportBodyAltRewind, because this skips setting up the altProto map in transport.go/onceSetNextProtoDefaults(), as altRT is nil.

It looks like moving the rewind code from L497 to L455 would satisfy the case where an existing connection receives a GoAway frame and the Body had been sent to the server (as described above), however it would also execute in the case of the initial request where a transport does not exist and the request hasn't been sent (which makes that execution not necessary).

I believe this test case demonstrates the issue described above:

func TestTransportBodyGoAwayReceived(t *testing.T) {
        var err error
        var resp *Response
        req, _ := NewRequest("POST", "https://example.org/", bytes.NewBufferString("request"))
        tr := &Transport{DisableKeepAlives: true }
        tr.RegisterProtocol("https", httpsProto{})
        c := &Client{Transport: tr}
        resp, err = c.Do(req)
        if err != nil {
                t.Error(err)
        }
        if resp != nil {
                defer resp.Body.Close()
                ioutil.ReadAll(resp.Body)
        }
}

type httpsProto struct{}

func (httpsProto) RoundTrip(req *Request) (*Response, error) {
        defer req.Body.Close()
        ioutil.ReadAll(req.Body)
        return nil, ErrSkipAltProtocol
}

output

../bin/go test -v -run=TestTransportBodyGoAwayReceived net/http
=== RUN   TestTransportBodyGoAway
2019/06/04 21:36:24 Error enabling Transport HTTP/2 support: protocol https already registered
--- FAIL: TestTransportBodyGoAway (0.42s)
    transport_internal_test.go:274: Post https://example.org/: http: ContentLength=7 with Body length 0
FAIL
FAIL	net/http	0.495s

Moving the rewind code to the top of the for loop as described above results in success:

./bin/go test -v -run=TestTransportBodyGoAwayReceived net/http
=== RUN   TestTransportBodyGoAway
2019/06/04 21:35:54 Error enabling Transport HTTP/2 support: protocol https already registered
--- PASS: TestTransportBodyGoAway (0.56s)
PASS
ok  	net/http	0.642s
@gopherbot gopherbot added this to the Unreleased milestone Jun 5, 2019
@brianfife
Copy link
Author

As I am not in a position to sign the CLA and submit a fix, it would be good to get some maintainers to have a look at this issue.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 10, 2019
@dmitshur
Copy link
Contributor

/cc @bradfitz @FiloSottile

@brianfife
Copy link
Author

Sorry to bump, but is anyone able to have a look at this ? @bradfitz @FiloSottile

@dcarney
Copy link

dcarney commented Jul 19, 2019

Thanks for opening this issue @brianfife. Also, thanks for linking to #25009 and #28206, as they appear to be related (even though they were marked as closed as of Go 1.12). I'm on 1.12.7 and I'm seeing this issue with clients connecting to the API I work on.

You did a good job of showing the issue in the actual transport code, but I figured I'd add some of my client debug logs to the issue for additional context.

As you described, I always see this issue manifest after the server sends a GOAWAY frame after a request body has been sent. The internal retry logic in http2shouldRetryRequest in h2_bundle.go calls .GetBody() on the request in order to "recreate" the body to send on a new TCP connection, but the bytes never make it onto the wire:

...
2019/07/18 12:51:41 http2: Framer 0xc00027a000: read GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
2019/07/18 12:51:41 http2: Transport received GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
2019/07/18 12:51:41 http2: Transport failed to get client conn for api.foobar.com:443: http2: no cached connection was available
...
2019/07/18 12:51:41 http2: Transport creating client conn 0xc000508900 to 12.34.56.78:443
2019/07/18 12:51:41 http2: Transport encoding header "content-type" = "application/x-www-form-urlencoded"
2019/07/18 12:51:41 http2: Transport encoding header "content-length" = "51"
2019/07/18 12:51:41 http2: Transport encoding header "accept-encoding" = "gzip"
2019/07/18 12:51:41 http2: Framer 0xc000409500: wrote HEADERS flags=END_HEADERS stream=1 len=135
2019/07/18 12:51:41 http2: Framer 0xc000409500: wrote DATA flags=END_STREAM stream=1 len=0 data=""

In the above, we can clearly see that the request contained a 51 byte, form-encoded body but the H/2 framer never sent it, setting the END_STREAM flag after sending 0 bytes of data.

We typically see this happen when the client is using a moderate amount of concurrency, since the chances are increased that the periodic GOAWAY frames from the server will arrive after a request body has been sent.

Interestingly, the Go bug manifests when the transport is left to choose H/2 on it's own via the connection upgrade codepath (aka TLSNextProto).

So, either of these options will result in a client that does not repro this bug:

// Force H/2 explicitly
client := &http.Client{
    Transport: &http2.Transport{},
}

// OR Force HTTP/1.1 explicitly
client := http.Client{
    Transport: &http.Transport{
        TLSNextProto: make(map[string]func(string, *tls.Conn) http.RoundTripper),
    },
}

// OR, run the repro code with GODEBUG="http2client=0" to disable H/2

brandur-stripe pushed a commit to stripe/stripe-go that referenced this issue Jul 24, 2019
See #866, but once in a while when using HTTP/2, stripe-go sends an
empty request body which then results in a 400 back from the server.

We tracked the issue down to this one open on the Go language's HTTP/2
implementation:

golang/go#32441

For now, the best thing we can do is disable HTTP/2 by default for new
integrations. When the bug is fixed, we can turn it back on again.
@dcarney
Copy link

dcarney commented Aug 20, 2019

To echo @brianfife, I'm also sorry to bump this, but I'm wondering if there's any movement or thoughts on how to move forward with this particular issue? @bradfitz @FiloSottile

@ernado
Copy link
Contributor

ernado commented Sep 26, 2019

Correct, body is not re-winded after altRT.RoundTrip here:

if t.useRegisteredProtocol(req) {
altProto, _ := t.altProto.Load().(map[string]RoundTripper)
if altRT := altProto[scheme]; altRT != nil {
if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
return resp, err
}
}
}

I've missed that during #25009 fix because the trigger condition is less frequent and was not covered by my go-only test-case.
This condition triggers when http2Transport.RoundTripOpt is failing to get a connection during retry (e.g. on attempt to retry after GOAWAY when all other connections in pool are terminated), which results to http2noCachedConnError that is transformed into ErrSkipAltProtocol in http2noDialH2RoundTripper.RoundTrip.

If we change to something like that, everything should be OK:

	if t.useRegisteredProtocol(req) {
		altProto, _ := t.altProto.Load().(map[string]RoundTripper)
		if altRT := altProto[scheme]; altRT != nil {
			if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
				return resp, err
			} else if req.GetBody != nil {
				// Rewind the body after ErrSkipAltProtocol, because it can
				// be already used.
				newReq := *req
				var err error
				newReq.Body, err = req.GetBody()
				if err != nil {
					return nil, err
				}
				req = &newReq
			}
		}
	}

I've tested this change with my repro and got zero "400 Bad Request" errors, but I'm not sure that it is correct solution, probably we can refactor something to make this rewind more explicit.

/cc @bradfitz

@bobby-stripe
Copy link

the above looks reasonable to me, @ernado do you want to open a PR?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/210123 mentions this issue: net/http: rewind request body on ErrSkipAltProtocol

darkdarkdragon added a commit to livepeer/go-livepeer that referenced this issue Feb 12, 2020
orchestrator and transcoder.
Mitigates issue in the golang's implementation of HTTP2
golang/go#32441
darkdarkdragon added a commit to livepeer/go-livepeer that referenced this issue Feb 12, 2020
orchestrator and transcoder.
Mitigates issue in the golang's implementation of HTTP2
golang/go#32441
@brianfife
Copy link
Author

@ernado @bcmills any updates on this?

@bcmills
Copy link
Contributor

bcmills commented Mar 2, 2020

@brianfife, I left a comment on @ernado's CL 210123 indicating what I think is the way forward: namely, we need to add an optional method that a RoundTripper can implement in order to decline a request without reading its Body.

@answer1991
Copy link

/cc

Could we fix this issue before next release? As an important feature(GOAWAY) in Kubernetes API Server rely on that.

@ernado
Copy link
Contributor

ernado commented May 20, 2020

No ETA from me for CL 210123, so please submit other CL if anybody wants, just check that commends from my CL are addressed.

Also current workaround is to use http2-only client.

@networkimprov
Copy link

See also https://golang.org/cl/234358 from #39086

@answer1991
Copy link

Thanks, let me see if there is anything I can do to fix this issue ASAP.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/234894 mentions this issue: net/http: handle body rewind in HTTP/2 connection loss better

@rsc rsc changed the title x/net/http: Re-connect with upgraded HTTP2 connection fails to send Request.body net/http: Re-connect with upgraded HTTP2 connection fails to send Request.body May 22, 2020
@rsc rsc modified the milestones: Unreleased, Go1.15 May 22, 2020
@rsc rsc added NeedsFix The path to resolution is known, but the work has not been done. release-blocker and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. help wanted labels May 22, 2020
@rsc
Copy link
Contributor

rsc commented May 22, 2020

This issue was incorrectly titled (x/net/http instead of net/http) and therefore incorrectly milestoned (Unreleased) and left out of Go 1.15 release planning. I've retiteld and remilestoned. Since this bug is affecting an important existing use case (Kubernetes), I've also marked it a release blocker. I also sent a fix.

The code change is fairly straightforward, but net/http itself is quite subtle, so it would be best to get the change into beta1 to get the most possible testing. Therefore I did not add the "okay-after-beta1" label.

@brianfife
Copy link
Author

brianfife commented May 22, 2020

Thanks for correcting the title, and submitting a patch - glad it's finally getting some traction!

@ayanamist
Copy link

This issue was incorrectly titled (x/net/http instead of net/http) and therefore incorrectly milestoned (Unreleased) and left out of Go 1.15 release planning. I've retiteld and remilestoned. Since this bug is affecting an important existing use case (Kubernetes), I've also marked it a release blocker. I also sent a fix.

The code change is fairly straightforward, but net/http itself is quite subtle, so it would be best to get the change into beta1 to get the most possible testing. Therefore I did not add the "okay-after-beta1" label.

Is it possible make this bugfix backport to old version like v1.14.x or v1.13.x ? Since kubernetes is still using go1.13 to build itself.

@answer1991
Copy link

Thanks @rsc for your PR, which is very similar with my idea to fix this issue.

@rsc
Copy link
Contributor

rsc commented May 22, 2020

Indeed, @answer1991, thanks for that CL as a starting point.

@networkimprov
Copy link

@gopherbot please backport, as Kubernetes is affected.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #39278 (for 1.13), #39279 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/242117 mentions this issue: [release-branch.go1.14] net/http: handle body rewind in HTTP/2 connection loss better

gopherbot pushed a commit that referenced this issue Aug 27, 2020
…tion loss better

In certain cases the HTTP/2 stack needs to resend a request.
It obtains a fresh body to send by calling req.GetBody.
This call was missing from the path where the HTTP/2
round tripper returns ErrSkipAltProtocol, meaning fall back
to HTTP/1.1. The result was that the HTTP/1.1 fallback
request was sent with no body at all.

This CL changes that code path to rewind the body before
falling back to HTTP/1.1. But rewinding the body is easier
said than done. Some requests have no GetBody function,
meaning the body can't be rewound. If we need to rewind and
can't, that's an error. But if we didn't read anything, we don't
need to rewind. So we have to track whether we read anything,
with a new ReadCloser wrapper. That in turn requires adding
to the couple places that unwrap Body values to look at the
underlying implementation.

This CL adds the new rewinding code in the main retry loop
as well.

The new rewindBody function also takes care of closing the
old body before abandoning it. That was missing in the old
rewind code.

Thanks to Aleksandr Razumov for CL 210123
and to Jun Chen for CL 234358, both of which informed
this CL.

Updates #32441.
Fixes #39279.

Change-Id: Id183758526c087c6b179ab73cf3b61ed23a2a46a
Reviewed-on: https://go-review.googlesource.com/c/go/+/234894
Run-TryBot: Russ Cox <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
Reviewed-by: Bryan C. Mills <[email protected]>
(cherry picked from commit e3491c4)
Reviewed-on: https://go-review.googlesource.com/c/go/+/242117
Run-TryBot: Dmitri Shuralyov <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Emmanuel Odeke <[email protected]>
@golang golang locked and limited conversation to collaborators Jul 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests