Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Collector potential memory leak #802

Closed
mxiamxia opened this issue Apr 7, 2020 · 4 comments
Closed

OTel Collector potential memory leak #802

mxiamxia opened this issue Apr 7, 2020 · 4 comments
Labels
help wanted Good issue for contributors to OpenTelemetry Service to pick up

Comments

@mxiamxia
Copy link
Member

mxiamxia commented Apr 7, 2020

I have the collector setup with the following pipeline config and I am sending about 500 spans requests per second to Jaeger Receiver in the collector. The collector heap size starts growing after running for 10 - 20 minutes and the collector instance will crash at the end. From the profiling, I can see the span data buffered in jaeger.jSpansToOCProtoSpans(jspans []*jaeger.Span). I also tried to use Zipkin Receiver and has the same problem that the data buffered in zipkinreceiver.zipkinTagsToTraceAttributes.

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [opencensus, jaeger, zipkin]
      exporters: [jaeger, zipkin]
      processors: [batch, queued_retry]
    metrics:
      receivers: [opencensus]
      exporters: [prometheus]

Screen Shot 2020-04-07 at 11 40 37 AM

@mxiamxia mxiamxia changed the title OTel Collector OTel Collector potential memory leak Apr 7, 2020
@tigrannajaryan
Copy link
Member

Can you also attach Collector logs?

Do you have both Jaeger and Zipkin backends up and running and accepting from Collector?

@mxiamxia
Copy link
Member Author

mxiamxia commented Apr 7, 2020

Hi Tigran, I have both Jaeger and Zipkin backends running in the separate containers(see attached picture). I am seeing the following errors in the logs. I guess the root cause could be when the backend(zipkin server) failed to handle the post request, Collector failed to process the data and kept buffering the data in the collector size and get OOM finally?

^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432713200Z   {"exporter": "logging"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432735000Z {"level":"warn","ts":1586212728.2547228,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.19.0.4:14250: connect: connection refused\"","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432767200Z {"level":"warn","ts":1586212728.2583811,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.559982900Z {"level":"warn","ts":1586212488.5281117,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560011900Z {"level":"warn","ts":1586212488.5281756,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":16}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560034700Z {"level":"warn","ts":1586212488.5282025,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560155800Z {"level":"warn","ts":1586212488.535339,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560244700Z {"level":"warn","ts":1586212488.5357203,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":3}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560279000Z {"level":"warn","ts":1586212488.5359087,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560323400Z {"level":"warn","ts":1586212488.5417295,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}

Screen Shot 2020-04-07 at 2 40 55 PM

@tigrannajaryan tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Apr 22, 2020
@james-bebbington
Copy link
Member

james-bebbington commented May 6, 2020

I did a bit of investigation on this and as far as I can tell this seems to be working as intended (no memory leak), although there are definitely things that could be improved.

If I configured the Zipkin and Jaeger back ends correctly I was able to push through 1000s of traces per second with minimal memory usage, but if I didn't configure a Zipkin or Jaeger back end (or misconfigured the exporter), then the unsent traces would backup and consume a lot of memory (but not unbounded).

The minimal needed to replicate this is to start the following containers (as per examples/demo):

demo_jaeger-emitter_1
demo_otel-agent_1

^ Configure the emitter to push a large number of traces

traces:
      receivers: [jaeger]
      exporters: [jaeger] # or zipkin
      processors: [queued_retry]

The queued_retry processor defaults to a limit of 5000 items in the queue (+10 in process), but "items" are in this case batches of traces received from Jaeger. The emitter used in the demo uses a 5s flush interval and the Jaeger client defaults to 1MB max packet size. So if you're pushing a decent number of reasonably large traces, you'll be sending through a lot of large packets (say 500/s * ~500 bytes per trace > 1MB/5s), and you're looking at consuming 5000 * 1MB = 5GB of memory in total (in my tests actual memory usage was roughly 50% more than this, presumably because of extra overhead with the internal representation, etc). This will accumulate pretty slowly, maxing out eventually after around 30 minutes (depending on how frequently you're pushing spans), but presumably consuming enough memory to make you see OOM errors sometime before that.


Note we use Jaeger's bounded queue internally. They recently added a Resize() function to the bounded queue to allow them to support adjusting the queue size based on consumed memory: jaegertracing/jaeger#943

That's something we might want to consider for the future. For now, it might be a good idea to reduce the default queue size somewhat (maybe 2000 is more reasonable, although I'm not sure how large batches we would reasonably expect from various receivers?).


Not completely relevant, but its also worth noting this regarding how items are dropped if the bounded queue reaches its limit: jaegertracing/jaeger#1947

@tigrannajaryan
Copy link
Member

Closing based on comment from @james-bebbington
Feel free to reopen if there is evidence that the leak exists.

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
* Update Tracer API with instrumentation version

Add option to the `Provider.Tracer` method to specify the
instrumentation version.

Update the global, noop, opentracing bridge, and default SDK
implementations.

This does not propagate the instrumentation library version to the
exported span. That is left for a follow-on PR.

* Revert trace_test.go

This is for the next PR.

* Update SDK to include version for default instrumentation

If the instrumentation library name is empty and the default
instrumentation is uses, include the SDK version.

* Update comments and documentation

* Remove default instrumentation version
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024
* Added dns config param

* Updated examples

* Updated patch version

* Fixed examples

* Updated examples

* Fixed examples and validation in notes

* Rebased and updated examples

* Updated examples with new chart version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Good issue for contributors to OpenTelemetry Service to pick up
Projects
None yet
Development

No branches or pull requests

3 participants