-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression Upgrading from v0.71.0 -> v0.73.0 #19798
Comments
Hard to tell what is causing this. Your set up is too complex to understand which component triggered the raise in usage. Can you try upgrading to 72? |
I agreed with this, can you try 0.72 first, we can narrow down the scope of the problem. |
Unfortunately all my test apps didn't reproduce it and when it hit production it was a non trivial incident. However if it helps my best guess is it will be in the logs pipelines... The reason is we have only just introduced engineers to Metrics and Traces and we don't have lots of adoption and the impact was much wider than the adoption of traces and metrics. Plus the Logs are very chatty since many services were using logs for metrics. It's on my roadmap to get some sort of test that will trigger this but with current priorities I have no way to try and repro this right now. We have just pinned to 0.71 for now |
The other detail was that the performance was probably related to an increase in memory usage. We have the memory_limiter configured so most of the cpu was the memory limiter trying to get the memory down I assume |
We experienced the similar issue upgrading from One commonality between the above config and ours is
|
We are also seeing this in 0.74. As above, we are also using the fluentforward receiver and the loki exporter. This started happening when we added these 2 components and started shipping more logs. We see CPU and memory slowly going up. |
Please use the pprof extension to take a snapshot of memory usage over time and see where the leak might be. |
Pinging code owners for receiver/fluentforward: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for exporter/loki: @gramidt @gouthamve @jpkrohling @kovrus @mar4uk. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@jpkrohling I'm pretty sure it's actually the fluent forward receiver, as this was changed between these releases. I'm trying some details from pprof. |
One idea in case pprof doesn't help: you can try building your own collector distribution, using the latest Loki exporter and the previous Fluentforward receiver. Something like:
|
@atoulme It looks like it's the fluentforward receiver. I've attached the image. Perhaps this is related to #18860? This seems to be the only change recently. Here is a screenshot of the resources of the tasks. You can see that the CPU and the memory are increasing slowly over time. There are some dips, but these are reboots: This is a staging environment and it receives fairly minimal traffic. |
Looks like it's indeed caused by #18860 We have one testbed test case covering fluentforward receiver and it shows a significant increase from:
to
@TylerHelmuth, do you have a chance to take a look into that? |
I'm looking into it, but don't see anything immediately wrong with #18860. The pattern fluentforward receiver is using for obsreport is the same as other receivers in Contrib. |
Ok I found the problem. #18860 implementation passed the function's The problem is the the ctx is never changing. The receiver is setup with a single context at the start up, and that is the context that gets passed into processLogs and then into With the obsreport stuff:
Without:
I'm not very familiar with receivers or fluentbit so I'm not sure if the single, long-lived context is inappropriate or if there is a bug in obserport that is mutating the context incorrectly. A potential solution to this problem is to pass in a new @dmitryax how would you like to proceed? |
@TylerHelmuth nice investigation. Thank you! I believe we should not mutate the original long-lived context.
It should be done like this for other receivers, we probably didn't run into this because other receivers use a context per request, not the one that reused in a consume loop |
I believe this is the culprit then, but I haven't dug into it: https://github.com/open-telemetry/opentelemetry-collector/blob/62410af79ab454e43cf7aec2b0bd8fa6f5f8b3ba/obsreport/obsreport_receiver.go#L219 |
I think obsreport is good. We just shouldn't mutate the existing Let's do
instead |
Ah, I mis-interpreted your comment. I've actually already tested that scenario and it works as expected. I'll submit a PR. |
Component(s)
No response
What happened?
Description
Last week we rolled an upgraded version of OpenTelmetry Contrib to our production environment and started having memory and CPU problems resulting in dropped metrics and high CPU alarms.
We upgraded straight from 71 to 73 no idea which version was the origination of the regression
Steps to Reproduce
Expected Result
Actual Result
You can see from the ecs metrics below where the cpu and memory drops is where we rolled back to v0.71.0.
The blank prior to the lines was the sidecar dropping data due to being overwhelmed
Collector version
v0.73.0
Environment information
Environment
OS: ECS Fargate Linux
OpenTelemetry Collector configuration
Log output
No response
Additional context
I use a confd based entrypoint in my container to render the config hence the metadata templating pieces.
We utilize the APK published by the contrib project for alpine
The text was updated successfully, but these errors were encountered: