Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage when XDS is used #7657

Closed
snewell opened this issue Sep 21, 2024 · 17 comments
Closed

High memory usage when XDS is used #7657

snewell opened this issue Sep 21, 2024 · 17 comments
Assignees

Comments

@snewell
Copy link

snewell commented Sep 21, 2024

What version of gRPC are you using?

Verified with multiple versions from v1.61.0 through v1.66.2

What version of Go are you using (go version)?

Tested with multiple go versions include 1.21.11 and 1.22.6

What operating system (Linux, Windows, …) and version?

Docker containers based on two versions of Linux: ubuntu 22 and redhat 9

What did you do?

Memory usage increases seemingly unchecked on a server when XDS is enabled and there are inbound requests. This does not happen without XDS, on a completely idle server, or when grpc-go < v1.61.0 (verified with v1.60.0 and v1.60.1). A full reproducible example with server, client, and protobufs (all stripped to what I believe are minimum) is available here: https://github.com/snewell/memleak-demo.

We're using an istio sidecar for XDS. I don't have access to another XDS provider (e.g., traffic director). Verified the memory usage is in my gRPC server, not the sidecar (pprof output included below). Without XDS memory usage is flat, even with istio enabled.

Created a stripped down server and protobuf that does nothing other than provide an endpoint. With a client connecting and sending requests every 5 seconds, memory increased. This was also observed a more complex system as well, but the stripped down version is included here. Relevant code:

Protos

syntax = "proto3";

option go_package = "github.com/snewell/memleak-demo/internal/pb";

package service;

service SS {
  rpc DoIt (Request) returns (stream Response) {}
}

message Request {
}

message Response {
}

Server

type ss struct {
	pb.UnimplementedSSServer
}

func (s ss) DoIt(*pb.Request, grpc.ServerStreamingServer[pb.Response]) error {
	return nil
}

func serverFn(cmd *cobra.Command, args []string) {
	go func() {
		log.Println(http.ListenAndServe(":6060", nil))
	}()

	listenAddr := fmt.Sprintf(":%v", listenPort)
	log.Printf("Listening on %v", listenAddr)
	lis, err := net.Listen("tcp", listenAddr)
	if err != nil {
		log.Fatalf("failed to listen: %v", err)

	}

	creds := insecure.NewCredentials()
	if creds, err = xdscreds.NewServerCredentials(xdscreds.ServerOptions{FallbackCreds: creds}); err != nil {
		log.Fatalf("failed to create server-side xDS credentials: %v", err)
	}

	xdsServer, err := xds.NewGRPCServer(grpc.Creds(creds))
	if err != nil {
		log.Fatalf("failed to create xds server: %v", err)
	}
	s := ss{}
	pb.RegisterSSServer(xdsServer, s)
	log.Printf("Starting server...")
	if err := xdsServer.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

What did you expect to see?

Memory shouldn't increase this much.

What did you see instead?

Very high memory usage, enough that autoscalers and OOM-killer are kicking in on the real project.

Pprof output

File: memleak-demo
Type: inuse_space
Time: Sep 21, 2024 at 6:17pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 6546.80MB, 97.37% of 6723.83MB total
Dropped 85 nodes (cum <= 33.62MB)
Showing top 20 nodes out of 34
      flat  flat%   sum%        cum   cum%
 3128.52MB 46.53% 46.53%  3128.52MB 46.53%  bufio.NewReaderSize (inline)
 3117.12MB 46.36% 92.89%  3117.12MB 46.36%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
  153.54MB  2.28% 95.17%   153.54MB  2.28%  golang.org/x/net/http2/hpack.(*headerFieldTable).addEntry (inline)
  109.03MB  1.62% 96.79%  6431.67MB 95.65%  google.golang.org/grpc/internal/transport.NewServerTransport
   17.59MB  0.26% 97.05%    48.59MB  0.72%  google.golang.org/grpc/xds/internal/server.(*listenerWrapper).Accept
      16MB  0.24% 97.29%       50MB  0.74%  google.golang.org/grpc/internal/transport.newLoopyWriter
       3MB 0.045% 97.34%  6300.15MB 93.70%  google.golang.org/grpc/internal/transport.newFramer
    1.50MB 0.022% 97.36%   100.53MB  1.50%  golang.org/x/net/http2.(*Framer).readMetaFrame
    0.50MB 0.0074% 97.37%   128.52MB  1.91%  google.golang.org/grpc/internal/transport.NewServerTransport.func2
         0     0% 97.37%    48.59MB  0.72%  github.com/snewell/memleak-demo/cmd.Execute
         0     0% 97.37%    48.59MB  0.72%  github.com/snewell/memleak-demo/cmd.serverFn
         0     0% 97.37%    48.59MB  0.72%  github.com/spf13/cobra.(*Command).Execute (inline)
         0     0% 97.37%    48.59MB  0.72%  github.com/spf13/cobra.(*Command).ExecuteC
         0     0% 97.37%    48.59MB  0.72%  github.com/spf13/cobra.(*Command).execute
         0     0% 97.37%   109.03MB  1.62%  golang.org/x/net/http2.(*Framer).ReadFrame
         0     0% 97.37%    99.03MB  1.47%  golang.org/x/net/http2/hpack.(*Decoder).Write
         0     0% 97.37%    99.03MB  1.47%  golang.org/x/net/http2/hpack.(*Decoder).parseFieldLiteral
         0     0% 97.37%    99.03MB  1.47%  golang.org/x/net/http2/hpack.(*Decoder).parseHeaderFieldRepr
         0     0% 97.37%    66.02MB  0.98%  golang.org/x/net/http2/hpack.(*Encoder).WriteField
         0     0% 97.37%   153.54MB  2.28%  golang.org/x/net/http2/hpack.(*dynamicTable).add
@clement2026
Copy link
Contributor

@snewell could you attach the *.pb.gz file?

@snewell
Copy link
Author

snewell commented Sep 22, 2024

memleak-heap.pprof.gz

@clement2026 - I attached the heap profile from pprof. I'm not familiar with pb.gz, so if that's something else let me know and I'll try to get what you need. Thanks!

@dfawley
Copy link
Member

dfawley commented Sep 23, 2024

My guess, due to timeline & area, is this is related to #6915 - @zasweq can you take a look?

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

I think you need a defer cc.Close() right here: https://github.com/snewell/memleak-demo/blob/master/cmd/client.go#L48. https://github.com/snewell/memleak-demo/blob/master/run_test.sh#L3 you run the client in a while true, and the client spawns 10 workers which each create a ClientConn, then you make 20 streaming RPCs serially. The RPC's finish so RPC and related state gets cleaned up server side, but the top level ClientConn is never closed. This leaves around the created HTTP/2 connections on the server. This aligns with pprof, with the Accepts and NewServerTransports being a lot of the allocated memory.

@snewell
Copy link
Author

snewell commented Sep 23, 2024

@zasweq - I'll add the call to Close and see if it makes a difference. Would there still be an issue though since server code isn't handling a misbehaving client well? The internal product where we first saw this uses persistent connections if that helps, so there aren't new ones being created regularly (it was observed on a totally idle deployment just doing health checks). My test client is a bit more aggressive just because that made it manifest faster.

@snewell
Copy link
Author

snewell commented Sep 23, 2024

And not sure if this is relevant, but in the real product I've turned on keepalive settings to try and prune idle connections. I didn't include it here in an effort to make the reproducible example as minimal as possible, but happy to add that code as well if you think it'll make a difference.

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

"Would there still be an issue though since server code isn't handling a misbehaving client well?" - yeah this is another one of my thoughts. It does worry me that you didn't see this on 1.60 and there's an xDS/non xDS distinction, hopefully it's because our listener/connection wrappers just create more state?

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

Yeah the os should close the http/2 connection so this is probably a real leak.

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

"but in the real product I've turned on keepalive settings to try and prune idle connections. I didn't include it here in an effort to make the reproducible example as minimal as possible, but happy to add that code as well if you think it'll make a difference." - alright yeah sorry I don't know too much about operating systems but knowing the OS should be responsible for closing the Conns if the binary exits there's a leak server side. We suspect it's something in the Conn Wrapper that the xDS Server creates. I'll continue to try and debug this, thank you for bringing this up and creating a reproducible test case.

@snewell
Copy link
Author

snewell commented Sep 23, 2024

I'm a dev too, so I know how much a reproducible test case helps :). I'll deploy the updated client later today (corporate firewall rules while I'm in the office) and update with results tomorrow (my initial results posted here took about 12 hours). If there's anything else you want me to test or tweak in the code, let me know and I'll help anyway I can.

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

I think I see what's happening. I grab a ref to the server transport (with the reader and writer frames) here: https://github.com/grpc/grpc-go/pull/6915/files#diff-dd56a1b7688625b5b70cd616b08c301d12f7f01edbd9be95c506743fd58a6155R140. I never clear this reference if the connection wrapper lives around. So when gRPC calls Close on the connection, it continues to live around in this slice: https://github.com/grpc/grpc-go/pull/6915/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R144. The listener stays around, and it only gives up the wrapper ref on a state update and filter chain update. So the solution here is to keep track of connection closing in the listener wrapper, and remove it's ref when the wrapped connection closes.

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

Sent #7664, I think that should fix it writing a unit test now.

@snewell
Copy link
Author

snewell commented Sep 23, 2024

I deployed with the tip of your branch (v1.64.0-dev.0.20240923205304-082ee300e801 in go.mod). Thanks for the fast turnaround!

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

Alright awesome let me know if it helps!

@snewell
Copy link
Author

snewell commented Sep 23, 2024

So far so good. It's been live for about an hour and memory is flat.

Really appreciate how fast you jumped on this. Thanks a ton!

@zasweq
Copy link
Contributor

zasweq commented Sep 23, 2024

Awesome glad to hear!

@purnesh42H
Copy link
Contributor

The patch will be part of 1.67.1, 1.66.3 and 1.65.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants