Unbounded memory growth / leak in pipelines #209

derekperkins · 2023-01-13T07:22:54Z

Here's the RAM usage over the course of 2.5 hours, showing unbounded memory growth. This happens on every pod until they gets OOMKilled. There are about 100 RPS to the grpc api, on 30 separate pods. Each request makes on average 1 pipeline Exec call with about 100 keys per call. I've run pprof repeatedly that gives different results than my OS metrics are telling me. I'm only seeing this when using the pipeline, our other service just does single Get/Put calls and maintains steady memory usage just a little over the DMap allocation.

There are 2 DMaps, with the following config.

	// create a new Olric configuration
	cfg := config.New("lan") // default configuration
	cfg.ServiceDiscovery = map[string]any{
		"plugin": k8sDisc,
	}
	cfg.ReplicationMode = config.AsyncReplicationMode
	cfg.LogLevel = "WARN"
	cfg.LogVerbosity = 1

	cfg.DMaps.Custom[name] = config.DMap{
		EvictionPolicy:  config.LRUEviction,
		MaxIdleDuration: 24 * time.Hour,
		MaxInuse:        50_000_000,
	}

pprof -alloc_objects

go tool pprof -inuse_objects http://localhost:9090/debug/pprof/heap

Type: inuse_objects
Time: Jan 13, 2023 at 12:03am (MST)

Showing nodes accounting for 2137797, 71.48% of 2990871 total
Dropped 129 nodes (cum <= 14954)
Showing top 10 nodes out of 128
      flat  flat%   sum%        cum   cum%
    460136 15.38% 15.38%     460136 15.38%  github.com/buraksezer/olric.mapToRoutingTable
    415070 13.88% 29.26%     415070 13.88%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
    218460  7.30% 36.57%     218460  7.30%  net.sockaddrToTCP
    213004  7.12% 43.69%     213004  7.12%  github.com/go-redis/redis/v8/internal/proto.NewWriter (inline)
    210928  7.05% 50.74%     210928  7.05%  bufio.NewReaderSize (inline)
    162753  5.44% 56.18%     162753  5.44%  time.NewTicker
    136300  4.56% 60.74%     136300  4.56%  github.com/go-redis/redis/v8/internal/pool.NewConnPool
    131075  4.38% 65.12%     131075  4.38%  syscall.anyToSockaddr
     98315  3.29% 68.41%      98315  3.29%  net.newFD
     91756  3.07% 71.48%     636946 21.30%  github.com/go-redis/redis/v8/internal/pool.NewConn

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz

pprof -alloc_objects

go tool pprof -alloc_objects http://localhost:9090/debug/pprof/heap

Type: alloc_objects
Time: Jan 13, 2023 at 12:05am (MST)

(pprof) top
Showing nodes accounting for 2447758068, 86.92% of 2816051715 total
Dropped 1298 nodes (cum <= 14080258)
Showing top 10 nodes out of 117
      flat  flat%   sum%        cum   cum%
1412298640 50.15% 50.15% 2140089146 76.00%  github.com/go-redis/redis/v8.sliceParser
 412719257 14.66% 64.81% 2143579060 76.12%  github.com/go-redis/redis/v8/internal/proto.(*Reader).ReadReply
 318266246 11.30% 76.11%  318266246 11.30%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
 150051797  5.33% 81.44%  150051797  5.33%  github.com/buraksezer/olric.mapToRoutingTable
  49612998  1.76% 83.20%   49612998  1.76%  strconv.formatBits
  28112132     1% 84.20%   77594057  2.76%  strconv.AppendInt
  22158716  0.79% 84.98%   42573491  1.51%  github.com/go-redis/redis/v8.(*Client).Pipeline
  20414775  0.72% 85.71%   20414775  0.72%  github.com/go-redis/redis/v8.(*Pipeline).init (inline)
  17542143  0.62% 86.33%   33190294  1.18%  vitess.io/vitess/go/vt/vtgate/evalengine.ToNative
  16581364  0.59% 86.92%   48094774  1.71%  github.com/buraksezer/olric/internal/discovery.NewMemberFromMetadata

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.011.pb.gz

pprof -inuse_space

go tool pprof -inuse_space http://localhost:9090/debug/pprof/heap

Type: inuse_space
Time: Jan 13, 2023 at 12:11am (MST)

Showing nodes accounting for 1.11GB, 91.93% of 1.21GB total
Dropped 166 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 115
      flat  flat%   sum%        cum   cum%
    0.38GB 31.36% 31.36%     0.38GB 31.36%  bufio.NewReaderSize (inline)
    0.26GB 21.61% 52.96%     0.26GB 21.61%  github.com/tidwall/redcon.AppendBulk
    0.23GB 19.30% 72.26%     0.23GB 19.30%  bufio.NewWriterSize
    0.08GB  6.92% 79.18%     0.19GB 15.57%  github.com/tidwall/redcon.NewReader
    0.05GB  4.39% 83.57%     0.05GB  4.39%  github.com/go-redis/redis/v8/internal/pool.NewConnPool
    0.04GB  3.44% 87.01%     0.04GB  3.44%  github.com/buraksezer/olric.mapToRoutingTable
    0.02GB  2.02% 89.03%     0.02GB  2.02%  runtime.malg
    0.01GB  1.01% 90.03%     0.01GB  1.01%  net.newFD
    0.01GB  0.97% 91.00%     0.01GB  0.97%  github.com/go-redis/redis/v8/internal/proto.NewWriter
    0.01GB  0.93% 91.93%     0.01GB  0.93%  github.com/buraksezer/olric/internal/kvstore/table.New

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.014.pb.gz

pprof -alloc_space

go tool pprof -alloc_space http://localhost:9090/debug/pprof/heap

Type: alloc_space
Time: Jan 13, 2023 at 12:09am (MST)

Showing nodes accounting for 125.66GB, 77.90% of 161.31GB total
Dropped 1268 nodes (cum <= 0.81GB)
Showing top 10 nodes out of 152
      flat  flat%   sum%        cum   cum%
   42.78GB 26.52% 26.52%    56.50GB 35.03%  github.com/go-redis/redis/v8.sliceParser
   36.81GB 22.82% 49.34%    36.81GB 22.82%  github.com/buraksezer/olric.mapToRoutingTable
    8.84GB  5.48% 54.82%    10.01GB  6.21%  github.com/tidwall/redcon.appendPrefix
    7.66GB  4.75% 59.57%    13.63GB  8.45%  github.com/tidwall/redcon.AppendBulkString
    7.31GB  4.53% 64.10%     7.31GB  4.53%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
    6.53GB  4.05% 68.14%    56.63GB 35.10%  github.com/go-redis/redis/v8/internal/proto.(*Reader).ReadReply
    6.35GB  3.94% 72.08%     6.40GB  3.97%  github.com/tidwall/redcon.AppendBulk
    4.75GB  2.94% 75.02%     4.75GB  2.94%  bufio.NewReaderSize (inline)
    2.36GB  1.46% 76.49%     2.36GB  1.46%  bufio.NewWriterSize
    2.28GB  1.41% 77.90%     4.62GB  2.86%  github.com/tidwall/redcon.NewReader

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.013.pb.gz

The text was updated successfully, but these errors were encountered:

derekperkins · 2023-01-14T00:43:46Z

Here are some screenshots from our continuous profiler showing 24 hours of data (averaged over 250 samples) from yesterday with olric and the same day last week without olric running.

In Use Heap Profiles

Before Olric

After Olric

Alloc Heap Profiles

Before Olric

After Olric

CPU Profiles

I expected an increase in CPU given that there would be serialization during lookups compared to the local cache with ristretto, but wasn't really expecting the 5% CPU to turn into > 80% CPU for the whole service dedicated for Olric. A decent chunk of that is from increased GC from all the allocations.

Before Olric

After Olric

derekperkins · 2023-01-14T00:57:37Z

The good news is that this also supports that it's primarily an issue with pipelines. In a separate service that only does single gets/puts and no pipelines that has been running in production longer, while there is definitely an increase in heap usage and GC, it doesn't completely dominate the entire service, of which olric is just a small part.

In Use Heap Profiles

Before Olric

After Olric

Alloc Heap Profiles

Before Olric

After Olric

CPU Profiles

Before Olric

After Olric

buraksezer · 2023-01-15T13:47:16Z

@derekperkins thank you for the detailed bug report. I applied a fix to overcome the problem. Previously, EmbeddedClient.Pipeline was creating a new cluster client for every new pipeline. Reusing can partly overcome this problem but all pipelines should use the same underlying cluster client.

The other problem was the routing table fetch interval. ClusterClient implementation retrieves the routing table periodically to route requests to the partition owner. The client can work without this "smart pick" logic but it increases the latency. The interval was too short: 15 seconds. I set a new default value (1 min) and added the WithRoutingTableFetchInterval function for setting an appropriate interval for the environment.

Now you can use EmbeddedClient.Pipeline to create a new pipeline. EmbeddedClient implementation creates only one ClusterClient and reuses it. You can also use ClusterClient.Pipeline to create a new pipeline. It's already used by EmbeddedClient.Pipeline function.

So my theory was you have many pipeline instances in runtime and they had their own ClusterClient instances. Even worse, all these cluster clients sent thousands of requests to retrieve the same routing table. Parsing and setting a new routing table in the client consumed lots of memory.

There is a lot of room to improve the pipeline implementation but the latest version should fix the memory leak problem. Could you upgrade to Olric v0.5.2 and observe the system?

derekperkins · 2023-01-16T05:57:47Z

v0.5.2 has been running for about 10 hours now and looks solid. I can confirm that these fixes solved the memory leak.

In Use Heap Profile

Alloc Heap Profile

CPU Profile

As you mentioned, there is definitely still room for improvement, but at least this is stable enough to keep running.

I'm seeing an increased number of of i/o timeout errors, and I need to tweak the error handling because as is, it's taking up a lot of resources just inside our code as well.

Thanks for the quick fix!

buraksezer self-assigned this Jan 14, 2023

buraksezer added bug Something isn't working dmap investigation release:v0.5.x Issues related with 0.5.x labels Jan 14, 2023

buraksezer mentioned this issue Jan 15, 2023

fix: wrong design of Pipeline method in EmbeddedClient implementation. #211

Merged

derekperkins closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbounded memory growth / leak in pipelines #209

Unbounded memory growth / leak in pipelines #209

derekperkins commented Jan 13, 2023

derekperkins commented Jan 14, 2023

derekperkins commented Jan 14, 2023

buraksezer commented Jan 15, 2023

derekperkins commented Jan 16, 2023

Unbounded memory growth / leak in pipelines #209

Unbounded memory growth / leak in pipelines #209

Comments

derekperkins commented Jan 13, 2023

pprof -alloc_objects

pprof -alloc_objects

pprof -inuse_space

pprof -alloc_space

derekperkins commented Jan 14, 2023

In Use Heap Profiles

Before Olric

After Olric

Alloc Heap Profiles

Before Olric

After Olric

CPU Profiles

Before Olric

After Olric

derekperkins commented Jan 14, 2023

In Use Heap Profiles

Before Olric

After Olric

Alloc Heap Profiles

Before Olric

After Olric

CPU Profiles

Before Olric

After Olric

buraksezer commented Jan 15, 2023

derekperkins commented Jan 16, 2023

In Use Heap Profile

Alloc Heap Profile

CPU Profile