OTel Collector potential memory leak #802

mxiamxia · 2020-04-07T18:28:38Z

I have the collector setup with the following pipeline config and I am sending about 500 spans requests per second to Jaeger Receiver in the collector. The collector heap size starts growing after running for 10 - 20 minutes and the collector instance will crash at the end. From the profiling, I can see the span data buffered in jaeger.jSpansToOCProtoSpans(jspans []*jaeger.Span). I also tried to use Zipkin Receiver and has the same problem that the data buffered in zipkinreceiver.zipkinTagsToTraceAttributes.

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [opencensus, jaeger, zipkin]
      exporters: [jaeger, zipkin]
      processors: [batch, queued_retry]
    metrics:
      receivers: [opencensus]
      exporters: [prometheus]

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2020-04-07T18:54:08Z

Can you also attach Collector logs?

Do you have both Jaeger and Zipkin backends up and running and accepting from Collector?

mxiamxia · 2020-04-07T22:09:04Z

Hi Tigran, I have both Jaeger and Zipkin backends running in the separate containers(see attached picture). I am seeing the following errors in the logs. I guess the root cause could be when the backend(zipkin server) failed to handle the post request, Collector failed to process the data and kept buffering the data in the collector size and get OOM finally?

^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432713200Z   {"exporter": "logging"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432735000Z {"level":"warn","ts":1586212728.2547228,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.19.0.4:14250: connect: connection refused\"","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432767200Z {"level":"warn","ts":1586212728.2583811,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.559982900Z {"level":"warn","ts":1586212488.5281117,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560011900Z {"level":"warn","ts":1586212488.5281756,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":16}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560034700Z {"level":"warn","ts":1586212488.5282025,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560155800Z {"level":"warn","ts":1586212488.535339,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560244700Z {"level":"warn","ts":1586212488.5357203,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":3}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560279000Z {"level":"warn","ts":1586212488.5359087,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560323400Z {"level":"warn","ts":1586212488.5417295,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}

james-bebbington · 2020-05-06T03:40:48Z

I did a bit of investigation on this and as far as I can tell this seems to be working as intended (no memory leak), although there are definitely things that could be improved.

If I configured the Zipkin and Jaeger back ends correctly I was able to push through 1000s of traces per second with minimal memory usage, but if I didn't configure a Zipkin or Jaeger back end (or misconfigured the exporter), then the unsent traces would backup and consume a lot of memory (but not unbounded).

The minimal needed to replicate this is to start the following containers (as per examples/demo):

demo_jaeger-emitter_1
demo_otel-agent_1

^ Configure the emitter to push a large number of traces

traces:
      receivers: [jaeger]
      exporters: [jaeger] # or zipkin
      processors: [queued_retry]

The queued_retry processor defaults to a limit of 5000 items in the queue (+10 in process), but "items" are in this case batches of traces received from Jaeger. The emitter used in the demo uses a 5s flush interval and the Jaeger client defaults to 1MB max packet size. So if you're pushing a decent number of reasonably large traces, you'll be sending through a lot of large packets (say 500/s * ~500 bytes per trace > 1MB/5s), and you're looking at consuming 5000 * 1MB = 5GB of memory in total (in my tests actual memory usage was roughly 50% more than this, presumably because of extra overhead with the internal representation, etc). This will accumulate pretty slowly, maxing out eventually after around 30 minutes (depending on how frequently you're pushing spans), but presumably consuming enough memory to make you see OOM errors sometime before that.

Note we use Jaeger's bounded queue internally. They recently added a Resize() function to the bounded queue to allow them to support adjusting the queue size based on consumed memory: jaegertracing/jaeger#943

That's something we might want to consider for the future. For now, it might be a good idea to reduce the default queue size somewhat (maybe 2000 is more reasonable, although I'm not sure how large batches we would reasonably expect from various receivers?).

Not completely relevant, but its also worth noting this regarding how items are dropped if the bounded queue reaches its limit: jaegertracing/jaeger#1947

tigrannajaryan · 2020-06-11T17:39:48Z

Closing based on comment from @james-bebbington
Feel free to reopen if there is evidence that the leak exists.

* Update Tracer API with instrumentation version Add option to the `Provider.Tracer` method to specify the instrumentation version. Update the global, noop, opentracing bridge, and default SDK implementations. This does not propagate the instrumentation library version to the exported span. That is left for a follow-on PR. * Revert trace_test.go This is for the next PR. * Update SDK to include version for default instrumentation If the instrumentation library name is empty and the default instrumentation is uses, include the SDK version. * Update comments and documentation * Remove default instrumentation version

* Added dns config param * Updated examples * Updated patch version * Fixed examples * Updated examples * Fixed examples and validation in notes * Rebased and updated examples * Updated examples with new chart version

mxiamxia changed the title ~~OTel Collector~~ OTel Collector potential memory leak Apr 7, 2020

tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Apr 22, 2020

james-bebbington mentioned this issue May 6, 2020

REQUEST: New membership for james-bebbington open-telemetry/community#342

Closed

6 tasks

tigrannajaryan closed this as completed Jun 11, 2020

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

Update docs for manual fluentd install/config (open-telemetry#802)

92fe2d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTel Collector potential memory leak #802

OTel Collector potential memory leak #802

mxiamxia commented Apr 7, 2020 •

edited

Loading

tigrannajaryan commented Apr 7, 2020

mxiamxia commented Apr 7, 2020 •

edited

Loading

james-bebbington commented May 6, 2020 •

edited

Loading

tigrannajaryan commented Jun 11, 2020

OTel Collector potential memory leak #802

OTel Collector potential memory leak #802

Comments

mxiamxia commented Apr 7, 2020 • edited Loading

tigrannajaryan commented Apr 7, 2020

mxiamxia commented Apr 7, 2020 • edited Loading

james-bebbington commented May 6, 2020 • edited Loading

tigrannajaryan commented Jun 11, 2020

mxiamxia commented Apr 7, 2020 •

edited

Loading

mxiamxia commented Apr 7, 2020 •

edited

Loading

james-bebbington commented May 6, 2020 •

edited

Loading