-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage for otelcolector during scraping of prometheus metrics #21912
Comments
Hello! I'm the creator of the kube-otel-stack chart and will probably be upstreaming what I'm about to recommend... @vermaprateek695 i would recommend adding a filter strategy in to the target allocator configuration which will drop targets that aren't being scraped which should help reduce memory usage. We've also found that increasing the time that the collector is pulling configuration from the target allocator decreases target allocator memory usage. In addition to all of that, what version of the collector and target allocator are you on? |
Hi @jaronoff97
If possible can you provide any reference or example on how to implement the filter strategy in target allocator
|
Apologies, the docs are still being worked on here. sorry i didn't mean the scrape interval for your scrape configs, but rather from the collector getting configuration from the target allocator. Either way, i double checked the values in the chart and they should be okay. How much memory do your collectors and target allocator(s) use? Are you able to send the count of the |
Hi @jaronoff97 Thanks for the reply and addressing the solution. we will check the documentation on how to add filter strategy. Also will provide the memory consumed by both target allocator and otel collector but from my observation from past couple of weeks, target allocator consume minimum memory , most memory consuming component is otel collector pod , more than 85% up to 96%. Regarding count of scrape_samples_scraped, any hint how to get those counts? And also now we have another problem of buffer full , so now we dont see data being dropped due to memory full but instead export is failing due to buffer is full for exporter. Any idea on this as well. It looks like it shows lot of performance issue. Here are the logs for buffer issue, so once this buffer issue resolved and starts exporting metrics it will be back to consuming memory and then we will be able to provide concrete memory consumption. 2023-05-17T17:10:06.090Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1684343396088, "target_labels": "{name="up", endpoint="http-metrics-dnsmasq", instance="100.96.2.80:10054", job="kube-dns", namespace="kube-system", pod="coredns-7d8c995bfd-h68gg", service="prometheus-stack-kube-dns"}"} |
Hi @jaronoff97 Thanks for the reply and addressing the solution. we will check the documentation on how to add filter strategy. Also will provide the memory consumed by both target allocator and otel collector but from my observation from past couple of weeks, target allocator consume minimum memory , most memory consuming component is otel collector pod , more that 85% up to 96%. Regarding count of scrape_samples_scraped, any hint how to get those counts? And also now we have another problem of buffer full , so now we dont see data being dropped due to memory full but instead export is failing due to buffer is full for exporter. Any idea on this as well. It looks like it shows lot of performance issue. Here are the logs for buffer issue, so once this buffer issue resolved and starts exporting metrics it will be back to consuming memory and then we will be able to provide concrete memory consumption. 2023-05-17T17:10:06.090Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1684343396088, "target_labels": "{name="up", endpoint="http-metrics-dnsmasq", instance="100.96.2.80:10054", job="kube-dns", namespace="kube-system", pod="coredns-7d8c995bfd-h68gg", service="prometheus-stack-kube-dns"}"} |
Hello, not sure i've seen this one before... I'm wondering if tuning the exporter's configuration may help here... docs here. Something like decreasing the queue size? |
Hi @jaronoff97 , Decreasing queue size did not work for us and another option to use persistence queue to store queue in container filesystem instead of inmemory buffer throws error asking it cannot find the directory. As per the official documentation by default storage directory for file_storage is /var/lib/otelcol/file_storage -->https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage but it after deploying it throws an error with "file/directory doesnt exist". It is also not possible to exec to container to make directory or volume. Is there anyway to achieve this? |
hm i may ask @swiatekm-sumo who i think knows more about that codepath than i do. |
Reading through this issue, it doesn't look like your problem lies with the collector at all, but rather with the remote throwing errors when you try to export: It's difficult to tell what exactly is the root cause here without further debugging. I'd proceed by doing the following:
I think you should ignore the storage extension and the persistent queue for now. Your problem is most likely that your queue keeps growing, changing queue type won't help with that. |
Hi @swiatekm-sumo We checked both the metrics (otelcol_exporter_queue_size, otelcol_exporter_sent_metric_points )and can see the value is zero for both the metrics. That's may be because of buffer gets full as soon as we deploy the otel collector Do you have any clue or suggestions here which can help on this? |
It looks like the remote (lightstep I'm guessing?) is just rejecting your data. Are you sure your exporter configuration is correct? You can also try to look at some other exporter metrics to confirm your data is being dropped. Overall though, you're going to need help from someone familiar with Lightstep and that Chart. I'm not sure why otel is using a lot memory with this setup. You could look at how much data you're scraping - I believe the metric for that is |
HI @swiatekm-sumo We are not using light step as remote system but we are using kibana. Buffer issue just started coming since last week and before that it was just memory issue that used cause the data drop. we are using the same exporter configuration which used to work few weeks back. |
Hi @swiatekm-sumo Correct me if this is not the right file to reduce the scraping endpoints. PFB attached scrape_configs file |
According to your metrics, you only have an otlp exporter, and it's not sending any data whatsoever. Can you post the full otel configuration that you're currently using? The one you linked in the original issue only has this: exporters:
otlp:
which the collector should reject due to it being invalid. |
otelcollector.txt attaching the complete otel config.yaml file , do let me know if you need any additional details. And also despite we have min scraping endpoints these many metrics point are failing , PFA screenshot for ref : |
Can you try replacing that exporter with the logging exporter temporarily and see how that affects the memory usage and these metrics? As a side note, these metrics are all Sums, so they're easier to visualise with a |
Hi @swiatekm-sumo , But isn't logging exporter for exporting logs or it can also export the metrics? And if logging exporter supports metrics pipeline can we use OTLP endpoint that we are using in current configuration to export metric to kibana? Regards, |
logging exporter supports all signal types, it's used for debugging. With the default configuration, it'll just log the number of data points it gets. I'd like you to replace your otlp exporter with logging, and see how that affects your setup. |
Hi @swiatekm-sumo As per your suggestion we tried with logging exporter and it worked successfully , and we dont see any memory/buffer issue so far. but now we want to export the metrics to kibana dashboard , so how do we export this metrics to any visualisation tool. We couldn't find any configuration for that. |
It looks like your Elasticsearch APM Server isn't accepting the metrics you're sending it over OTLP. Unfortunately I can't help you with that - you probably know more about this stack than I do. You'll have to debug that yourself. |
Hi @swiatekm-sumo Our kibana server/dashboard does accept the metrics over otlp endpoint since the time we started using from 4-5 weeks. And we could see all the required and expected metrics in dashboard but eventually we started getting memory full issue. And then it swtiched to this buffer full issue.
Is it possible that switching to logging exporter for almost 12 hrs might have cleared the exporter buffer and now when we again deployed with otlp exporter it started working? Also yesterday, we deployed otelcol on the cluster where there was no prometheus existing and it was running smoothly , does it imply that having he prometheus endpoints scraped consumes too much memory and buffer? |
I would encourage you to monitor the metrics we've looked at to see how much data you're actually sending and whether the queue size on your exporter is stable. If you'd like to troubleshoot the memory usage further, we're going to need more telemetry from your system, including the actual memory usage over time, datapoints sent, number of scrape targets, etc. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
We are using lightstep otelcolector stack integrated with otel-targetallocator to scrape prometheus metrics for application and cluster. It works perfectly fine and gives the expected outputs of all the required application metrics but it consumes lot of memory and stops scraping data in 5-10 mins as it exceeds its soft limit of memory consumption.
Any suggestion/Ideas are welcomed.
PFB Chart repo and values.yaml file with current configuration , which is being used for our Otel-setup:
https://github.com/lightstep/otel-collector-charts/tree/main/charts/kube-otel-stack
otel-values.txt
The text was updated successfully, but these errors were encountered: