Memory grows fast, suspected a leak #21484

bitomaxsp · 2023-05-03T08:38:33Z

After the update to tag v0.76.1 and deployment to production we noticed memory grows up to the set limits.
Grow rate is ~18Mb/min

Steps to reproduce
I assume deploying the collector and applying metric point rate at ~350-400 should be enough.

What did you expect to see?
Expectation is that memory will grow at least with the same rate as on previously deployed version which is v0.50.0.

What did you see instead?
A clear and concise description of what you saw instead.

What version did you use?
Version: v0.76.1

What config did you use?
Config:

  collector.yaml: |
    extensions:
      pprof:

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:55680
            read_buffer_size: 4096
            write_buffer_size: 4096
            keepalive:
              server_parameters:
                max_connection_age: 3600s

    processors:
      memory_limiter:
        check_interval: 10s
        limit_mib: 600
        spike_limit_mib: 150
      batch:
        timeout: 10s

    exporters:
      prometheus:
        endpoint: "0.0.0.0:9090"
        metric_expiration: 24h0m0s
        send_timestamps: false
        const_labels:
          collector: lambda-opentelemetry-collector
        resource_to_telemetry_conversion:
          enabled: true

    service:
      extensions: [pprof]

      telemetry:
        metrics:
          level: detailed
          address: '0.0.0.0:8888'
        logs:
          level: "info"

      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]

Environment
OS: (e.g., "Ubuntu 20.04")

  nodeInfo:
    architecture: amd64
    bootID: 8fec644d-23e1-41d6-98c7-e8a353405686
    containerRuntimeVersion: docker://20.10.13
    kernelVersion: 5.4.196-108.356.amzn2.x86_64
    kubeProxyVersion: v1.21.12-eks-5308cf7
    kubeletVersion: v1.21.12-eks-5308cf7
    machineID: ec2563e2103fb54ccd7895e788ca1992
    operatingSystem: linux
    osImage: Amazon Linux 2

Compiler(if manually compiled): (e.g., "go 1.20.3") compiled for amd64 arch.

K8s memory request: 250Mb
K8s memory limit: 800Mb

Memory consumption on v0.76.1 (time range 14 hours)

Memory consumption on v0.50.0 (time range ~2 hours)

atoulme · 2023-05-03T21:28:11Z

Can you please use the pprof extension and capture memory usage?

Since you use the prometheus exporter, I assume that you're using the contrib distribution or a distribution you created yourself. I will move this report to the contrib repository.

dmitryax · 2023-05-03T21:39:17Z

@bitomaxsp, thanks for reporting. Given that jump from 0.50.0 to v0.76.1 is pretty big, it's hard to pinpoint an issue. Would you mind helping us identify a specific version that contributed to the memory consumption the most? It'd be great if you could try a kind binary search starting from 0.63.0 and reduce the versions difference.

github-actions · 2023-05-03T21:40:37Z

Pinging code owners for exporter/prometheus: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

tj--- · 2023-06-10T15:40:18Z

I am facing a related issue, so thought of using this thread.
In my OTel-collector-contrib setup, I have influx as a receiver and prometheus as an exporter (along with logging). For the tests, I did not spin a prom server and hence no actual scraping is happening. In such a scenario, the memory is held up forever, and keeps growing, until it crashes. The behavior is the same when I use the default grpc receiver.

When I manually curl /metrics, the memory drops instantaneously as if it releases everything then. I was expecting the memory to be not held up beyond the metric_expiration configured for prom exporter (5s in my case).

Adding the heap dump & otel-config for reference.

otel-influx_prom_pprof_heap.heap.zip

otel-conf.yaml.txt

Is this an expected behavior?

dmitryax · 2023-06-10T19:21:32Z

This clearly seems like a bug in the Prometheus exporter. Any help would be appreciated. @tj---, can you help to figure out which version this bug was introduced?

@Aneurysm9 do you have a chance to take a look into it as a code owner?

tj--- · 2023-06-11T10:49:01Z

Sure, I'll do that. Will get back in a day or two.

tj--- · 2023-06-11T18:12:18Z

I went back up to 0.43.0 (the oldest available arm image) and the behavior appears to be the same. Will test the older amd images tomorrow.

tj--- · 2023-06-13T03:40:26Z

@dmitryax it looks like a design choice. A colleague investigated that the expiry possibly happens only during the collect.
This: func (a *lastValueAccumulator) Collect() ?

bitomaxsp · 2023-06-13T12:28:34Z

In our case we were scraping the /metrics all the time. And bug still reproducible.
But i didn't have enough time to dig. :(
It is on my radar though. no ETA yet

bitomaxsp · 2023-06-16T22:36:48Z

i tried 0.79. Issue is there with the same growth rate.

bitomaxsp · 2023-06-17T12:31:20Z

I managed to get some pictures from 0.79.0
But i have never profiled go apps before. So if you want me to run it using specific commands, please tell me. I can do that

bitomaxsp · 2023-06-17T12:36:04Z

I am posting graphs as i run the collector. Since it is only reproducible on high load i need to run it on prod env in controlled fashion

bitomaxsp · 2023-06-17T12:38:05Z

bitomaxsp · 2023-06-17T12:44:00Z

(pprof) top10
Showing nodes accounting for 919441, 94.07% of 977405 total
Dropped 101 nodes (cum <= 4887)
Showing top 10 nodes out of 90
      flat  flat%   sum%        cum   cum%
    420527 43.02% 43.02%     420527 43.02%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
    196610 20.12% 63.14%     617137 63.14%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*KeyValue).Unmarshal
     85289  8.73% 71.87%     146502 14.99%  go.opentelemetry.io/collector/pdata/pmetric.MetricSlice.CopyTo
     60076  6.15% 78.01%     546140 55.88%  go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*Metric).Unmarshal
     52577  5.38% 83.39%      52577  5.38%  go.opentelemetry.io/collector/pdata/pcommon.Map.PutEmpty
     32768  3.35% 86.74%      32768  3.35%  golang.org/x/net/http2/hpack.AppendHuffmanString
     21851  2.24% 88.98%      21851  2.24%  go.opentelemetry.io/collector/pdata/pcommon.copyFloat64Slice (inline)
     21850  2.24% 91.22%      21850  2.24%  go.opentelemetry.io/collector/pdata/pcommon.copyUInt64Slice (inline)
     16970  1.74% 92.95%      16970  1.74%  go.opentelemetry.io/collector/pdata/pmetric.NumberDataPointSlice.CopyTo
     10923  1.12% 94.07%      10923  1.12%  context.WithValue

bitomaxsp · 2023-06-17T12:44:36Z

bitomaxsp · 2023-06-17T12:45:49Z

(pprof) top10 -cum
Showing nodes accounting for 0, 0% of 977405 total
Dropped 101 nodes (cum <= 4887)
Showing top 10 nodes out of 90
      flat  flat%   sum%        cum   cum%
         0     0%     0%     677213 69.29%  github.com/golang/protobuf/proto.Unmarshal
         0     0%     0%     677213 69.29%  github.com/golang/protobuf/proto.UnmarshalMerge
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).Unmarshal
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1._MetricsService_Export_Handler
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ResourceMetrics).Unmarshal
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).handleStream
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).processUnaryRPC.func2
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).serveStreams.func1.1
         0     0%     0%     677213 69.29%  google.golang.org/grpc/encoding/proto.codec.Unmarshal

bitomaxsp · 2023-06-17T12:51:19Z

(pprof) top10 -cum
Showing nodes accounting for 4.01MB, 6.31% of 63.50MB total
Showing top 10 nodes out of 191
      flat  flat%   sum%        cum   cum%
         0     0%     0%    36.52MB 57.51%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry.(*wrapperMetricsExporter).ConsumeMetrics
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*batchMetrics).export
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start
         0     0%     0%    32.52MB 51.21%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry.convertToMetricsAttributes
         0     0%     0%    19.01MB 29.94%  go.opentelemetry.io/collector/pdata/pmetric.Metrics.CopyTo
    0.51MB   0.8%   0.8%    19.01MB 29.94%  go.opentelemetry.io/collector/pdata/pmetric.ResourceMetricsSlice.CopyTo
         0     0%   0.8%    18.51MB 29.14%  go.opentelemetry.io/collector/pdata/pmetric.ResourceMetrics.CopyTo
    0.50MB  0.79%  1.59%    18.51MB 29.14%  go.opentelemetry.io/collector/pdata/pmetric.ScopeMetricsSlice.CopyTo
       3MB  4.73%  6.31%    18.01MB 28.36%  go.opentelemetry.io/collector/pdata/pmetric.MetricSlice.CopyTo

bitomaxsp · 2023-06-17T13:03:38Z

bitomaxsp · 2023-06-20T09:46:55Z

@atoulme @dmitryax Do you fold know what is the way forward with this ?

atoulme · 2023-06-21T06:30:37Z

Do you run into a OOM eventually? This memory usage in absolute terms represents very small amounts of memory, 30MiB. It would be great to have snapshot 8 hours in.

bitomaxsp · 2023-06-21T09:09:32Z

I didn't run it that long. But i can try.

tj--- · 2023-06-26T03:33:17Z

@bitomaxsp @atoulme The OT collectors have been running for many days in our systems and I observe a slow leak (Influx input and the logger is the output). I am attaching the heap dumps 4 days apart.
otel_conf.yml.txt

june22_b2b834b35fe5.heap.zip
june26_b2b834b35fe5.heap.zip

atoulme · 2023-06-27T00:57:00Z

It looks like the diff shows that memory has been growing in the tracer `newRecordingSpan` function.

atoulme · 2023-06-27T00:57:37Z

This was done with go tool pprof -http=:8080 -base ~/Downloads/june22_b2b834b35fe5.heap ~/Downloads/june26_b2b834b35fe5.heap fwiw, hopefully this is the right command.

atoulme · 2023-06-27T01:06:16Z

This is because this line may be called multiple times:

opentelemetry-collector-contrib/receiver/carbonreceiver/transport/tcp_server.go

Line 165 in 9e4d50f

ctx := t.reporter.OnDataReceived(context.Background())

This leak is specific to the carbonreceiver handling of obsreport. It should not create multiple obsreports when reading each line.

atoulme · 2023-06-27T01:07:24Z

This might be a completely different issue than the issue first reported, fwiw.

dmitryax · 2023-06-27T01:50:17Z

@atoulme, did you use the config reported in the issue? The carbon exporter is not used there

dmitryax · 2023-06-27T02:01:34Z

@bitomaxsp the profiles don't show anything suspicious. The one with inuse_space is taken at 63.5MB, can you please take it when the memory is goes higher?

Also, did you have a chance to figure out what version introduced the issue between 0.50.0 and v0.76.1?

tj--- · 2023-06-27T03:17:19Z

@atoulme, did you use the config reported in the issue? The carbon exporter is not used there

My bad, there was a carbon receiver configured that I had removed from the config that I have uploaded here. (It wasn't receiving any traffic, so I thought it was not relevant).
Thanks, @atoulme.

bitomaxsp · 2023-07-25T18:46:19Z

I finally made it running for 5 days and i think behavior is interesting. I can;t explain it without reading the code. But i most likely can say there is no leak. I'll leave it running like this for 2-3 weeks and report back once i am back from vacation.

bitomaxsp · 2023-08-14T11:10:42Z

I confirm that on 79 an 81 are leak free.
But there is increased memory consumption compared to 51 which initialy looked like leak.

I have been running it for ~3 week in production and it's good.

bitomaxsp · 2023-08-14T11:13:21Z

Snapshot shows scale up and down of the pods.
k8s Mem request 400
k8s Lem limit 600
OTEL Memory limiter setting 450

bitomaxsp · 2023-08-14T11:13:39Z

I consider the issue solved unless there are concerns. Feel free to reopen it if needed.

bitomaxsp added the bug Something isn't working label May 3, 2023

atoulme transferred this issue from open-telemetry/opentelemetry-collector May 3, 2023

dmitryax added the exporter/prometheus label May 3, 2023

dmitryax added the help wanted Extra attention is needed label Jun 10, 2023

atoulme added the receiver/carbon label Jun 27, 2023

atoulme removed the exporter/prometheus label Jun 27, 2023

atoulme mentioned this issue Jul 14, 2023

Carbon receiver leaks obsrecv operations #24275

Closed

bitomaxsp closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory grows fast, suspected a leak #21484

Memory grows fast, suspected a leak #21484

bitomaxsp commented May 3, 2023 •

edited

Loading

atoulme commented May 3, 2023

dmitryax commented May 3, 2023 •

edited

Loading

github-actions bot commented May 3, 2023

tj--- commented Jun 10, 2023

dmitryax commented Jun 10, 2023

tj--- commented Jun 11, 2023

tj--- commented Jun 11, 2023

tj--- commented Jun 13, 2023 •

edited

Loading

bitomaxsp commented Jun 13, 2023

bitomaxsp commented Jun 16, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 20, 2023

atoulme commented Jun 21, 2023

bitomaxsp commented Jun 21, 2023

tj--- commented Jun 26, 2023 •

edited

Loading

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

dmitryax commented Jun 27, 2023

dmitryax commented Jun 27, 2023

tj--- commented Jun 27, 2023

bitomaxsp commented Jul 25, 2023

bitomaxsp commented Aug 14, 2023

bitomaxsp commented Aug 14, 2023

bitomaxsp commented Aug 14, 2023 •

edited

Loading

Memory grows fast, suspected a leak #21484

Memory grows fast, suspected a leak #21484

Comments

bitomaxsp commented May 3, 2023 • edited Loading

atoulme commented May 3, 2023

dmitryax commented May 3, 2023 • edited Loading

github-actions bot commented May 3, 2023

tj--- commented Jun 10, 2023

dmitryax commented Jun 10, 2023

tj--- commented Jun 11, 2023

tj--- commented Jun 11, 2023

tj--- commented Jun 13, 2023 • edited Loading

bitomaxsp commented Jun 13, 2023

bitomaxsp commented Jun 16, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 17, 2023

bitomaxsp commented Jun 20, 2023

atoulme commented Jun 21, 2023

bitomaxsp commented Jun 21, 2023

tj--- commented Jun 26, 2023 • edited Loading

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

atoulme commented Jun 27, 2023

dmitryax commented Jun 27, 2023

dmitryax commented Jun 27, 2023

tj--- commented Jun 27, 2023

bitomaxsp commented Jul 25, 2023

bitomaxsp commented Aug 14, 2023

bitomaxsp commented Aug 14, 2023

bitomaxsp commented Aug 14, 2023 • edited Loading

bitomaxsp commented May 3, 2023 •

edited

Loading

dmitryax commented May 3, 2023 •

edited

Loading

tj--- commented Jun 13, 2023 •

edited

Loading

tj--- commented Jun 26, 2023 •

edited

Loading

bitomaxsp commented Aug 14, 2023 •

edited

Loading