-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage when XDS is used #7657
Comments
@snewell could you attach the *.pb.gz file? |
@clement2026 - I attached the heap profile from pprof. I'm not familiar with pb.gz, so if that's something else let me know and I'll try to get what you need. Thanks! |
I think you need a defer cc.Close() right here: https://github.com/snewell/memleak-demo/blob/master/cmd/client.go#L48. https://github.com/snewell/memleak-demo/blob/master/run_test.sh#L3 you run the client in a while true, and the client spawns 10 workers which each create a ClientConn, then you make 20 streaming RPCs serially. The RPC's finish so RPC and related state gets cleaned up server side, but the top level ClientConn is never closed. This leaves around the created HTTP/2 connections on the server. This aligns with pprof, with the Accepts and NewServerTransports being a lot of the allocated memory. |
@zasweq - I'll add the call to Close and see if it makes a difference. Would there still be an issue though since server code isn't handling a misbehaving client well? The internal product where we first saw this uses persistent connections if that helps, so there aren't new ones being created regularly (it was observed on a totally idle deployment just doing health checks). My test client is a bit more aggressive just because that made it manifest faster. |
And not sure if this is relevant, but in the real product I've turned on keepalive settings to try and prune idle connections. I didn't include it here in an effort to make the reproducible example as minimal as possible, but happy to add that code as well if you think it'll make a difference. |
"Would there still be an issue though since server code isn't handling a misbehaving client well?" - yeah this is another one of my thoughts. It does worry me that you didn't see this on 1.60 and there's an xDS/non xDS distinction, hopefully it's because our listener/connection wrappers just create more state? |
Yeah the os should close the http/2 connection so this is probably a real leak. |
"but in the real product I've turned on keepalive settings to try and prune idle connections. I didn't include it here in an effort to make the reproducible example as minimal as possible, but happy to add that code as well if you think it'll make a difference." - alright yeah sorry I don't know too much about operating systems but knowing the OS should be responsible for closing the Conns if the binary exits there's a leak server side. We suspect it's something in the Conn Wrapper that the xDS Server creates. I'll continue to try and debug this, thank you for bringing this up and creating a reproducible test case. |
I'm a dev too, so I know how much a reproducible test case helps :). I'll deploy the updated client later today (corporate firewall rules while I'm in the office) and update with results tomorrow (my initial results posted here took about 12 hours). If there's anything else you want me to test or tweak in the code, let me know and I'll help anyway I can. |
I think I see what's happening. I grab a ref to the server transport (with the reader and writer frames) here: https://github.com/grpc/grpc-go/pull/6915/files#diff-dd56a1b7688625b5b70cd616b08c301d12f7f01edbd9be95c506743fd58a6155R140. I never clear this reference if the connection wrapper lives around. So when gRPC calls Close on the connection, it continues to live around in this slice: https://github.com/grpc/grpc-go/pull/6915/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R144. The listener stays around, and it only gives up the wrapper ref on a state update and filter chain update. So the solution here is to keep track of connection closing in the listener wrapper, and remove it's ref when the wrapped connection closes. |
Sent #7664, I think that should fix it writing a unit test now. |
I deployed with the tip of your branch (v1.64.0-dev.0.20240923205304-082ee300e801 in go.mod). Thanks for the fast turnaround! |
Alright awesome let me know if it helps! |
So far so good. It's been live for about an hour and memory is flat. Really appreciate how fast you jumped on this. Thanks a ton! |
Awesome glad to hear! |
The patch will be part of 1.67.1, 1.66.3 and 1.65.1 |
What version of gRPC are you using?
Verified with multiple versions from v1.61.0 through v1.66.2
What version of Go are you using (
go version
)?Tested with multiple go versions include 1.21.11 and 1.22.6
What operating system (Linux, Windows, …) and version?
Docker containers based on two versions of Linux: ubuntu 22 and redhat 9
What did you do?
Memory usage increases seemingly unchecked on a server when XDS is enabled and there are inbound requests. This does not happen without XDS, on a completely idle server, or when grpc-go < v1.61.0 (verified with v1.60.0 and v1.60.1). A full reproducible example with server, client, and protobufs (all stripped to what I believe are minimum) is available here: https://github.com/snewell/memleak-demo.
We're using an istio sidecar for XDS. I don't have access to another XDS provider (e.g., traffic director). Verified the memory usage is in my gRPC server, not the sidecar (pprof output included below). Without XDS memory usage is flat, even with istio enabled.
Created a stripped down server and protobuf that does nothing other than provide an endpoint. With a client connecting and sending requests every 5 seconds, memory increased. This was also observed a more complex system as well, but the stripped down version is included here. Relevant code:
Protos
Server
What did you expect to see?
Memory shouldn't increase this much.
What did you see instead?
Very high memory usage, enough that autoscalers and OOM-killer are kicking in on the real project.
Pprof output
The text was updated successfully, but these errors were encountered: