Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory grows fast, suspected a leak #21484

Closed
bitomaxsp opened this issue May 3, 2023 · 33 comments
Closed

Memory grows fast, suspected a leak #21484

bitomaxsp opened this issue May 3, 2023 · 33 comments
Labels
bug Something isn't working help wanted Extra attention is needed receiver/carbon

Comments

@bitomaxsp
Copy link

bitomaxsp commented May 3, 2023

After the update to tag v0.76.1 and deployment to production we noticed memory grows up to the set limits.
Grow rate is ~18Mb/min

Steps to reproduce
I assume deploying the collector and applying metric point rate at ~350-400 should be enough.

What did you expect to see?
Expectation is that memory will grow at least with the same rate as on previously deployed version which is v0.50.0.

What did you see instead?
A clear and concise description of what you saw instead.

What version did you use?
Version: v0.76.1

What config did you use?
Config:

  collector.yaml: |
    extensions:
      pprof:

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:55680
            read_buffer_size: 4096
            write_buffer_size: 4096
            keepalive:
              server_parameters:
                max_connection_age: 3600s

    processors:
      memory_limiter:
        check_interval: 10s
        limit_mib: 600
        spike_limit_mib: 150
      batch:
        timeout: 10s

    exporters:
      prometheus:
        endpoint: "0.0.0.0:9090"
        metric_expiration: 24h0m0s
        send_timestamps: false
        const_labels:
          collector: lambda-opentelemetry-collector
        resource_to_telemetry_conversion:
          enabled: true

    service:
      extensions: [pprof]

      telemetry:
        metrics:
          level: detailed
          address: '0.0.0.0:8888'
        logs:
          level: "info"

      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]

Environment
OS: (e.g., "Ubuntu 20.04")

  nodeInfo:
    architecture: amd64
    bootID: 8fec644d-23e1-41d6-98c7-e8a353405686
    containerRuntimeVersion: docker://20.10.13
    kernelVersion: 5.4.196-108.356.amzn2.x86_64
    kubeProxyVersion: v1.21.12-eks-5308cf7
    kubeletVersion: v1.21.12-eks-5308cf7
    machineID: ec2563e2103fb54ccd7895e788ca1992
    operatingSystem: linux
    osImage: Amazon Linux 2

Compiler(if manually compiled): (e.g., "go 1.20.3") compiled for amd64 arch.

K8s memory request: 250Mb
K8s memory limit: 800Mb

Memory consumption on v0.76.1 (time range 14 hours)
v0 76 1

Memory consumption on v0.50.0 (time range ~2 hours)
v0 50 0

@bitomaxsp bitomaxsp added the bug Something isn't working label May 3, 2023
@atoulme
Copy link
Contributor

atoulme commented May 3, 2023

Can you please use the pprof extension and capture memory usage?

Since you use the prometheus exporter, I assume that you're using the contrib distribution or a distribution you created yourself. I will move this report to the contrib repository.

@atoulme atoulme transferred this issue from open-telemetry/opentelemetry-collector May 3, 2023
@dmitryax
Copy link
Member

dmitryax commented May 3, 2023

@bitomaxsp, thanks for reporting. Given that jump from 0.50.0 to v0.76.1 is pretty big, it's hard to pinpoint an issue. Would you mind helping us identify a specific version that contributed to the memory consumption the most? It'd be great if you could try a kind binary search starting from 0.63.0 and reduce the versions difference.

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2023

Pinging code owners for exporter/prometheus: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@tj---
Copy link
Contributor

tj--- commented Jun 10, 2023

I am facing a related issue, so thought of using this thread.
In my OTel-collector-contrib setup, I have influx as a receiver and prometheus as an exporter (along with logging). For the tests, I did not spin a prom server and hence no actual scraping is happening. In such a scenario, the memory is held up forever, and keeps growing, until it crashes. The behavior is the same when I use the default grpc receiver.

When I manually curl /metrics, the memory drops instantaneously as if it releases everything then. I was expecting the memory to be not held up beyond the metric_expiration configured for prom exporter (5s in my case).

Adding the heap dump & otel-config for reference.

Screenshot 2023-06-10 at 9 09 14 PM

otel-influx_prom_pprof_heap.heap.zip

otel-conf.yaml.txt

Is this an expected behavior?

@dmitryax dmitryax added the help wanted Extra attention is needed label Jun 10, 2023
@dmitryax
Copy link
Member

This clearly seems like a bug in the Prometheus exporter. Any help would be appreciated. @tj---, can you help to figure out which version this bug was introduced?

@Aneurysm9 do you have a chance to take a look into it as a code owner?

@tj---
Copy link
Contributor

tj--- commented Jun 11, 2023

Sure, I'll do that. Will get back in a day or two.

@tj---
Copy link
Contributor

tj--- commented Jun 11, 2023

I went back up to 0.43.0 (the oldest available arm image) and the behavior appears to be the same. Will test the older amd images tomorrow.

@tj---
Copy link
Contributor

tj--- commented Jun 13, 2023

@dmitryax it looks like a design choice. A colleague investigated that the expiry possibly happens only during the collect.
This: func (a *lastValueAccumulator) Collect() ?

@bitomaxsp
Copy link
Author

In our case we were scraping the /metrics all the time. And bug still reproducible.
But i didn't have enough time to dig. :(
It is on my radar though. no ETA yet

@bitomaxsp
Copy link
Author

i tried 0.79. Issue is there with the same growth rate.

@bitomaxsp
Copy link
Author

I managed to get some pictures from 0.79.0
But i have never profiled go apps before. So if you want me to run it using specific commands, please tell me. I can do that
profile001

@bitomaxsp
Copy link
Author

profile002

I am posting graphs as i run the collector. Since it is only reproducible on high load i need to run it on prod env in controlled fashion

@bitomaxsp
Copy link
Author

profile003

@bitomaxsp
Copy link
Author

(pprof) top10
Showing nodes accounting for 919441, 94.07% of 977405 total
Dropped 101 nodes (cum <= 4887)
Showing top 10 nodes out of 90
      flat  flat%   sum%        cum   cum%
    420527 43.02% 43.02%     420527 43.02%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
    196610 20.12% 63.14%     617137 63.14%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*KeyValue).Unmarshal
     85289  8.73% 71.87%     146502 14.99%  go.opentelemetry.io/collector/pdata/pmetric.MetricSlice.CopyTo
     60076  6.15% 78.01%     546140 55.88%  go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*Metric).Unmarshal
     52577  5.38% 83.39%      52577  5.38%  go.opentelemetry.io/collector/pdata/pcommon.Map.PutEmpty
     32768  3.35% 86.74%      32768  3.35%  golang.org/x/net/http2/hpack.AppendHuffmanString
     21851  2.24% 88.98%      21851  2.24%  go.opentelemetry.io/collector/pdata/pcommon.copyFloat64Slice (inline)
     21850  2.24% 91.22%      21850  2.24%  go.opentelemetry.io/collector/pdata/pcommon.copyUInt64Slice (inline)
     16970  1.74% 92.95%      16970  1.74%  go.opentelemetry.io/collector/pdata/pmetric.NumberDataPointSlice.CopyTo
     10923  1.12% 94.07%      10923  1.12%  context.WithValue

@bitomaxsp
Copy link
Author

image

@bitomaxsp
Copy link
Author

(pprof) top10 -cum
Showing nodes accounting for 0, 0% of 977405 total
Dropped 101 nodes (cum <= 4887)
Showing top 10 nodes out of 90
      flat  flat%   sum%        cum   cum%
         0     0%     0%     677213 69.29%  github.com/golang/protobuf/proto.Unmarshal
         0     0%     0%     677213 69.29%  github.com/golang/protobuf/proto.UnmarshalMerge
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).Unmarshal
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1._MetricsService_Export_Handler
         0     0%     0%     677213 69.29%  go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ResourceMetrics).Unmarshal
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).handleStream
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).processUnaryRPC.func2
         0     0%     0%     677213 69.29%  google.golang.org/grpc.(*Server).serveStreams.func1.1
         0     0%     0%     677213 69.29%  google.golang.org/grpc/encoding/proto.codec.Unmarshal

@bitomaxsp
Copy link
Author

(pprof) top10 -cum
Showing nodes accounting for 4.01MB, 6.31% of 63.50MB total
Showing top 10 nodes out of 191
      flat  flat%   sum%        cum   cum%
         0     0%     0%    36.52MB 57.51%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry.(*wrapperMetricsExporter).ConsumeMetrics
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*batchMetrics).export
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems
         0     0%     0%    36.52MB 57.51%  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start
         0     0%     0%    32.52MB 51.21%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry.convertToMetricsAttributes
         0     0%     0%    19.01MB 29.94%  go.opentelemetry.io/collector/pdata/pmetric.Metrics.CopyTo
    0.51MB   0.8%   0.8%    19.01MB 29.94%  go.opentelemetry.io/collector/pdata/pmetric.ResourceMetricsSlice.CopyTo
         0     0%   0.8%    18.51MB 29.14%  go.opentelemetry.io/collector/pdata/pmetric.ResourceMetrics.CopyTo
    0.50MB  0.79%  1.59%    18.51MB 29.14%  go.opentelemetry.io/collector/pdata/pmetric.ScopeMetricsSlice.CopyTo
       3MB  4.73%  6.31%    18.01MB 28.36%  go.opentelemetry.io/collector/pdata/pmetric.MetricSlice.CopyTo

@bitomaxsp
Copy link
Author

profile007

@bitomaxsp
Copy link
Author

@atoulme @dmitryax Do you fold know what is the way forward with this ?

@atoulme
Copy link
Contributor

atoulme commented Jun 21, 2023

Do you run into a OOM eventually? This memory usage in absolute terms represents very small amounts of memory, 30MiB. It would be great to have snapshot 8 hours in.

@bitomaxsp
Copy link
Author

I didn't run it that long. But i can try.

@tj---
Copy link
Contributor

tj--- commented Jun 26, 2023

@bitomaxsp @atoulme The OT collectors have been running for many days in our systems and I observe a slow leak (Influx input and the logger is the output). I am attaching the heap dumps 4 days apart.
otel_conf.yml.txt

Screenshot 2023-06-26 at 9 02 32 AM

june22_b2b834b35fe5.heap.zip
june26_b2b834b35fe5.heap.zip

@atoulme
Copy link
Contributor

atoulme commented Jun 27, 2023

Screenshot 2023-06-26 at 5 55 03 PM It looks like the diff shows that memory has been growing in the tracer `newRecordingSpan` function.

@atoulme
Copy link
Contributor

atoulme commented Jun 27, 2023

This was done with go tool pprof -http=:8080 -base ~/Downloads/june22_b2b834b35fe5.heap ~/Downloads/june26_b2b834b35fe5.heap fwiw, hopefully this is the right command.

@atoulme
Copy link
Contributor

atoulme commented Jun 27, 2023

This is because this line may be called multiple times:

ctx := t.reporter.OnDataReceived(context.Background())

This leak is specific to the carbonreceiver handling of obsreport. It should not create multiple obsreports when reading each line.

@atoulme
Copy link
Contributor

atoulme commented Jun 27, 2023

This might be a completely different issue than the issue first reported, fwiw.

@dmitryax
Copy link
Member

@atoulme, did you use the config reported in the issue? The carbon exporter is not used there

@dmitryax
Copy link
Member

@bitomaxsp the profiles don't show anything suspicious. The one with inuse_space is taken at 63.5MB, can you please take it when the memory is goes higher?

Also, did you have a chance to figure out what version introduced the issue between 0.50.0 and v0.76.1?

@tj---
Copy link
Contributor

tj--- commented Jun 27, 2023

@atoulme, did you use the config reported in the issue? The carbon exporter is not used there

My bad, there was a carbon receiver configured that I had removed from the config that I have uploaded here. (It wasn't receiving any traffic, so I thought it was not relevant).
Thanks, @atoulme.

@bitomaxsp
Copy link
Author

I finally made it running for 5 days and i think behavior is interesting. I can;t explain it without reading the code. But i most likely can say there is no leak. I'll leave it running like this for 2-3 weeks and report back once i am back from vacation.

Screenshot 2023-07-25 at 20 44 10

@bitomaxsp
Copy link
Author

I confirm that on 79 an 81 are leak free.
But there is increased memory consumption compared to 51 which initialy looked like leak.

I have been running it for ~3 week in production and it's good.

@bitomaxsp
Copy link
Author

image

Snapshot shows scale up and down of the pods.
k8s Mem request 400
k8s Lem limit 600
OTEL Memory limiter setting 450

@bitomaxsp
Copy link
Author

bitomaxsp commented Aug 14, 2023

I consider the issue solved unless there are concerns. Feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed receiver/carbon
Projects
None yet
Development

No branches or pull requests

4 participants