Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing connection takes up to 15 minutes. #7314

Closed
fpetkovski opened this issue Jun 9, 2024 · 5 comments · Fixed by #7371
Closed

Closing connection takes up to 15 minutes. #7314

fpetkovski opened this issue Jun 9, 2024 · 5 comments · Fixed by #7371
Assignees

Comments

@fpetkovski
Copy link

fpetkovski commented Jun 9, 2024

What version of gRPC are you using?

1.64.0

What version of Go are you using (go version)?

go1.22.3

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

After upgrading to gRPC 1.64, closing connections started to take a very long time, specifically around 15 minutes. This happens when the server side abruptly goes away and the TCP connection breaks on one end.

I captured a goroutine profile while the application was trying to close one such connection. What stands out are the following two stacks:

#	0xbd151a	google.golang.org/grpc/internal/transport.(*http2Client).Close+0x2ba	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:1012
#	0xc21225	google.golang.org/grpc.(*addrConn).tearDown+0x325			/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1590
#	0xc1e627	google.golang.org/grpc.(*ClientConn).Close+0x207			/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1154

and

1 @ 0x43e48e 0x436f17 0x46bb65 0x4e53e7 0x4e87b6 0x4e87a3 0x589565 0x59b525 0xbe2876 0xbe275c 0xbc2570 0xbcbd5b 0x4716c1
#	0x46bb64	internal/poll.runtime_pollWait+0x84						/usr/local/go/src/runtime/netpoll.go:343
#	0x4e53e6	internal/poll.(*pollDesc).wait+0x26						/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0x4e87b5	internal/poll.(*pollDesc).waitWrite+0x2d5					/usr/local/go/src/internal/poll/fd_poll_runtime.go:93
#	0x4e87a2	internal/poll.(*FD).Write+0x2c2							/usr/local/go/src/internal/poll/fd_unix.go:388
#	0x589564	net.(*netFD).Write+0x24								/usr/local/go/src/net/fd_posix.go:96
#	0x59b524	net.(*conn).Write+0x44								/usr/local/go/src/net/net.go:191
#	0xbe2875	google.golang.org/grpc/internal/transport.(*bufWriter).flushKeepBuffer+0x55	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http_util.go:362
#	0xbe275b	google.golang.org/grpc/internal/transport.(*bufWriter).Flush+0x1b		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http_util.go:345
#	0xbc256f	google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x6f		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:592
#	0xbcbd5a	google.golang.org/grpc/internal/transport.newHTTP2Client.func6+0xba		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:467

I believe the (potential) bug was introduced on these lines where the client tries to send a GOAWAY packet to the server before closing the connection. In case the connection is half-closed, the call hangs for 15 minutes which is the default timeout for net.(*conn).Write.

What did you expect to see?

I would expect there to be a more reasonable timeout for closing a connection, or perhaps a way to control the timeout by the client.

What did you see instead?

Long time to close connections.

@atollena
Copy link
Collaborator

Could you tell us which version you used before, where you were seeing a different behavior? Are you actually calling Close() on the gRPC client/channel and it's blocking for 15 minutes? I wouldn't expect the change that you pointed out to change the behavior when the connection is not gracefully closed, but I could be missing something.

15 minutes is the time it takes with typical linux default settings to detect silent connection drops, unless gRPC keepalives are enabled. So the behavior you are seeing is expected if there are outstanding RPCs (more details e.g. in https://www.evanjones.ca/tcp-connection-timeouts.html). It's been like this forever I think, so I suspect you're talking about something different?

@atollena
Copy link
Collaborator

Are you actually calling Close() on the gRPC client/channel and it's blocking for 15 minutes?

(OK I see in your goroutine capture that you are closing the channel)

@fpetkovski
Copy link
Author

Thanks for your reply. We used to be on 1.62.1 and I don't think the Close method was blocking on dead connections. It started to do so because of the client-side GOAWAY. As can be seen on the the first goroutine stack, Close hangs on line 1012 in the http2_client which is exactly where the client waits for the GOAWAY packet to be sent. This behavior was introduced in #7015 and was not present before.

@atollena
Copy link
Collaborator

atollena commented Jun 13, 2024

OK, thank you for the detailed report. The problem can occur when a call to Write blocks, which I think happens when the OS socket buffer is full. This is more likely when the connection is silently dropped.

When we attempt to flush the control buffer in the call to clientconn.Close, we used to immediately call conn.Close() which would make any blocked read or write fail immediately (including previously written data that may be related to RPCs, e.g. in the case of a client stream). Now we want to give time to flush the goaway frame, and the obvious way to do that was to wait for all writes to finish, but I'm not sure we thought about the case of loopy being blocked on writing to the socket.

One option to give some time to loopy to return while not waiting until the operating system fails the socket would be to set a short deadline on the transport connection via t.conn.SetWriteDeadline, perhaps in func (t *http2Client) Close(err error). The next question is how do we decide an appropriate value for the write deadline...

@easwars
Copy link
Contributor

easwars commented Jun 25, 2024

Quoting from https://www.rfc-editor.org/rfc/rfc9113.html#name-connection-error-handling
~~
It is possible that the GOAWAY will not be reliably received by the receiving endpoint. In the event of a connection error, GOAWAY only provides a best-effort attempt to communicate with the peer about why the connection is being terminated.

An endpoint can end a connection at any time. In particular, an endpoint MAY choose to treat a stream error as a connection error. Endpoints SHOULD send a GOAWAY frame when ending a connection, providing that circumstances permit it.
~~

Maybe our GOAWAY sending on close should be a best effort, but I'm not sure if there is a way to do that without setting a write deadline on the underlying conn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment