-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/http2: connection-level flow control not returned if stream errors, causes server hang #40423
Comments
cc @fraenkel |
Change https://golang.org/cl/245158 mentions this issue: |
/cc @bradfitz @tombergan |
Thanks for looking into this so quickly! Once the fix is merged, is there any way we could get this fix backported to 1.14 (or make it's way into a 1.15 release)? |
Hello, i'm suspecting that I might have run into this issue when talking to GCS using the google-cloud-go package. It's a service that more or less proxies HTTP to GCS and under quite high load it usually stops talking to GCS after ~15min. When running with
Or is this something else or i'm doing something wrong? |
Heya, just checking in post-1.15 release to see if we can get the CR above merged? I wouldn't be surprised if @wadells' issue above is the result of this bug, and I suspect it wouldn't be surprised if this bug is affecting several other users (basically anyone that uses the gRPC cc @menghanl |
Let me know if there something i can help with, try to create a small reproduction case etc? Does anyone know if this is regression? is it an issue with specific combination of client and server implementation and version? |
@wader it's super easy to test if the patch fixes your issue - you can just apply it to a fork of https://github.com/golang/net, and then use go module's replace statement to depend upon your fork. If you can no longer repro the issue then this is the root cause! |
@jared2501 Thanks, good idea. Tried with patched net package (hope i patched it correctly, seems
Patch with HEAD at c89045814202410a2d67ec20ecf177ec77ceae7f
go.mod
|
Left it running for while after the error and it seems to be able to service some request after a while. Not sure why it's exactly 10 minutes between cancel and new errors. Could be some timeout duration in the client using the service also i guess?
|
Got ya, yeah sounds like it's not the issue then. FWIW, I've applied the linked patch to our production environments and have seen no further hangs. Before the patch, we were getting hangs once a day. |
@networkimprov @cagedmantis @fraenkel - sorry for the ping here, but any way we could find a person to CR https://go-review.googlesource.com/c/net/+/245158/? The fix and test look simple enough to me, and I can confirm this fix works in our prod env. |
@jared2501 Ok, i get those error after 30min-1h. Are these kind of error "normal" to get if you run into throttling/quotas etc? |
@andybons @ianlancetaylor @rsc - just a quick ping on this. We've applied the above patch to our production environment and seen it fix the above issues. Would it be possible to get https://go-review.googlesource.com/c/net/+/245158/ merged and back ported onto 1.15 and/or 1.14? |
@gopherbot Please open backport issues. This problem can reportedly cause an HTTP/2 connection to hang. There doesn't seem to be any reasonable workaround. |
Backport issue(s) opened: #41386 (for 1.14), #41387 (for 1.15). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
@ianlancetaylor - any chance we can find someone else to +1 the CR? |
cc @odeke-em as possible reviewer |
Thank you for the ping @networkimprov :) I've provided a review after Ian's and should be ready to roll out soon, and then we'll need to pull it into net/http/h2_bundle.go |
Woo, thank you! Can't wait to get this in the next 1.14/1.15 hot fixes! |
Btw @networkimprov - does anything have to be done to get the cherry picks made, or do those get picked up automatically? |
If no changes are required for the 1.14 & 1.15 branches, I think the release folks can cherrypick the CL. You might want to ping the CL author on the two backport issues to ask about it. |
Change https://golang.org/cl/258359 mentions this issue: |
Thank you @jared2501 and @networkimprov for raising it! |
Change https://golang.org/cl/258478 mentions this issue: |
Change https://golang.org/cl/258497 mentions this issue: |
Updates x/net/http2 to git rev 5d4f7005572804eaf7f5ecdd2473a62557f733ba http2: send WINDOW_UPDATE on a body's write failure https://golang.org/cl/245158 (fixes #40423) also updates the vendored version of golang.org/x/net as per $ go get golang.org/x/net@5d4f700557 $ go mod tidy $ go mod vendor $ go generate -run bundle std For #40423. Change-Id: I3270d0fb6f28889266596f7365d36d30ef2bb368 Reviewed-on: https://go-review.googlesource.com/c/go/+/258359 Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
Change https://golang.org/cl/258537 mentions this issue: |
Change https://golang.org/cl/258538 mentions this issue: |
Change https://golang.org/cl/258540 mentions this issue: |
Awesome, thank you @odeke-em & @networkimprov for your quick responses and effort pushing this through! |
…te failure When the body.Write fails during processData, the connection flow control must be updated to account for the data received. The connection's WINDOW_UPDATE should reflect the amount of data that was not successfully written. The stream is about to be closed, so no update is required. Updates golang/go#40423. Fixes golang/go#41387. Change-Id: I546597cedf3715e6617babcb3b62140bf1857a27 Reviewed-on: https://go-review.googlesource.com/c/net/+/245158 Reviewed-by: Emmanuel Odeke <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]> (cherry picked from commit 5d4f700) Reviewed-on: https://go-review.googlesource.com/c/net/+/258478 Run-TryBot: Dmitri Shuralyov <[email protected]> Trust: Dmitri Shuralyov <[email protected]>
…te failure When the body.Write fails during processData, the connection flow control must be updated to account for the data received. The connection's WINDOW_UPDATE should reflect the amount of data that was not successfully written. The stream is about to be closed, so no update is required. Updates golang/go#40423. Fixes golang/go#41386. Change-Id: I546597cedf3715e6617babcb3b62140bf1857a27 Reviewed-on: https://go-review.googlesource.com/c/net/+/245158 Reviewed-by: Emmanuel Odeke <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]> (cherry picked from commit 5d4f700) Reviewed-on: https://go-review.googlesource.com/c/net/+/258497 Trust: Dmitri Shuralyov <[email protected]> Run-TryBot: Dmitri Shuralyov <[email protected]>
Change https://golang.org/cl/261919 mentions this issue: |
Change https://golang.org/cl/264058 mentions this issue: |
…y's write failure When the body.Write fails during processData, the connection flow control must be updated to account for the data received. The connection's WINDOW_UPDATE should reflect the amount of data that was not successfully written. The stream is about to be closed, so no update is required. Updates golang/go#40423. For golang/go#41387. Change-Id: I546597cedf3715e6617babcb3b62140bf1857a27 Reviewed-on: https://go-review.googlesource.com/c/net/+/245158 Reviewed-by: Emmanuel Odeke <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]> (cherry picked from commit 5d4f700) Reviewed-on: https://go-review.googlesource.com/c/net/+/264058 Run-TryBot: Dmitri Shuralyov <[email protected]> Reviewed-by: Alexander Rakoczy <[email protected]> Trust: Dmitri Shuralyov <[email protected]>
…undle.go Features CL: net/http2: send WINDOW_UPDATE on a body's write failure (fixes #41386) https://golang.org/cl/258497 Created by: go get -d golang.org/x/[email protected] go mod tidy go mod vendor go generate -run=bundle std Updates #40423 Fixes #41386 Change-Id: I3e75527d381dd4c4262db5f2ff755029d448c48b Reviewed-on: https://go-review.googlesource.com/c/go/+/258538 Run-TryBot: Emmanuel Odeke <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Trust: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]>
…undle.go Features CL: net/http2: send WINDOW_UPDATE on a body's write failure https://golang.org/cl/258478 (updates #41387) Created by: go mod edit -replace=golang.org/x/net=golang.org/x/[email protected] GOFLAGS='-mod=mod' go generate -run=bundle std go mod edit -dropreplace=golang.org/x/net go get -d golang.org/x/[email protected] go mod tidy go mod vendor Updates #40423 Fixes #41387 Change-Id: I052037d6b6ed38b9d9782e19b8ce283875354c92 Reviewed-on: https://go-review.googlesource.com/c/go/+/258540 Run-TryBot: Emmanuel Odeke <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]>
There are several kubernetes bugs [0,1,2] involving connection problems that seem related to the Go net/http2 library, where the stream state and connection state can get out of sync. This can manifest as a kubelet issue, where the node status gets stuck in a NotReady state, but can also happen elsewhere. In newer versions of the Go libraries some issues are fixed [3,4], but the fixes are not present in k8s 1.18. This change disables http2 in kube-apiserver and webhook-apiserver. This should be sufficient to avoid the majority of the issues, as disabling on one side of the connection is enough, and apiserver is generally either the client or the server. 0: kubernetes/kubernetes#87615 1: kubernetes/kubernetes#80313 2: kubernetes/client-go#374 3: golang/go#40423 4: golang/go#40201 Change-Id: Id693a7201acffccbc4b3db8f4e4b96290fd50288
When the body.Write fails during processData, the connection flow control must be updated to account for the data received. The connection's WINDOW_UPDATE should reflect the amount of data that was not successfully written. The stream is about to be closed, so no update is required. Fixes golang/go#40423 Change-Id: I546597cedf3715e6617babcb3b62140bf1857a27 Reviewed-on: https://go-review.googlesource.com/c/net/+/245158 Reviewed-by: Emmanuel Odeke <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]> Trust: Emmanuel Odeke <[email protected]>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What's the issue
A hang was reported between a gRPC client (grpc-go v1.27.0) hitting a gRPC server in one of our production environments. The client and server are both running on the same host. I captured a core dump of the client and server code to analyze with delve. I noticed that the
google.golang.org/grpc/internal/transport.loopyWriter.cbuf.sendQuota
was 0 in the client code, which indicates that the client's connection-level send window had run out and was at 0. In the server's core dump, I tracked down the correspondinghttp2.serverConn
and noticed that it'sserverConn.inflow.n
was set to 0 too. I then tracked down the two places inhttp2/server.go
that callinflow.take
and noticed what I believe is the issue inprocessData
:In this code,
st.inflow.take
is called, but ifst.body.Write
returns an error then the flow control is not refunded to the client since the code bails and returns astreamError
(nor is it added to thest.body
'spipeBuffer
sincepipe.Write
returns immediately if it has an error to return).Side note:
st.body.Write
may return an error ifst.body.Close
is called. The server which had this issue is using grpc-go'sserverHandlerTransport
which does, in fact, callreq.Body.Close
(see here). A gRPC bi-directional streaming endpoint is running between the client and server, and what i suspect is happening is the client is sending the server data over the bi-di stream while an error happens in the gRPC server that causes the request to end, and thereforereq.Body.Close
to be called while data is in flight.Here's what I think a possible fix to net/http2 could look like:
The text was updated successfully, but these errors were encountered: