Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage for otelcolector during scraping of prometheus metrics #21912

Closed
vermaprateek695 opened this issue May 12, 2023 · 25 comments
Closed
Assignees
Labels
bug Something isn't working closed as inactive Stale

Comments

@vermaprateek695
Copy link

vermaprateek695 commented May 12, 2023

We are using lightstep otelcolector stack integrated with otel-targetallocator to scrape prometheus metrics for application and cluster. It works perfectly fine and gives the expected outputs of all the required application metrics but it consumes lot of memory and stops scraping data in 5-10 mins as it exceeds its soft limit of memory consumption.

Any suggestion/Ideas are welcomed.

PFB Chart repo and values.yaml file with current configuration , which is being used for our Otel-setup:

https://github.com/lightstep/otel-collector-charts/tree/main/charts/kube-otel-stack
otel-values.txt

image

@vermaprateek695 vermaprateek695 added the bug Something isn't working label May 12, 2023
@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector May 14, 2023
@jaronoff97
Copy link
Contributor

Hello! I'm the creator of the kube-otel-stack chart and will probably be upstreaming what I'm about to recommend...

@vermaprateek695 i would recommend adding a filter strategy in to the target allocator configuration which will drop targets that aren't being scraped which should help reduce memory usage. We've also found that increasing the time that the collector is pulling configuration from the target allocator decreases target allocator memory usage.

In addition to all of that, what version of the collector and target allocator are you on?

@vermaprateek695
Copy link
Author

vermaprateek695 commented May 17, 2023

Hi @jaronoff97

  1. We did not find any concrete example on how the filter strategy to be added in  the target allocator configuration ,
    [https://opentelemetry.io/docs/collector/configuration/]

  If possible can you provide any reference or example on how to implement the filter strategy in target allocator

  1. Also we have increased the metrics scraping time to 60sec from 30sec , and but still it did not improve the situation and now
    we error of buffer full.

    Updated the below scrape_configs_file with below changes
    scrape_interval: {{ .Values.kubelet.serviceMonitor.interval | default "60s" }}

image

PFB Logs: 

2023-05-17T08:03:06.089Z	warn	internal/transaction.go:121	Failed to scrape Prometheus endpoint	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1684310576088, "target_labels": "{__name__=\"up\", endpoint=\"http-metrics-dnsmasq\", instance=\"100.96.2.80:10054\", job=\"kube-dns\", ", service=\"prometheus-stack-kube-dns\"}"}
2023-05-17T08:03:07.347Z	error	exporterhelper/queued_retry.go:401	**Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.", "dropped_items": 73}**
  1. Collector and target allocator version is -> 0.73.0

@jaronoff97
Copy link
Contributor

Apologies, the docs are still being worked on here. sorry i didn't mean the scrape interval for your scrape configs, but rather from the collector getting configuration from the target allocator. Either way, i double checked the values in the chart and they should be okay.

How much memory do your collectors and target allocator(s) use? Are you able to send the count of the scrape_samples_scraped? It would be good to know how many metrics you are attempting to scrape.

@krimeshshah
Copy link

krimeshshah commented May 17, 2023

Hi @jaronoff97

Thanks for the reply and addressing the solution. we will check the documentation on how to add filter strategy. Also will provide the memory consumed by both target allocator and otel collector but from my observation from past couple of weeks, target allocator consume minimum memory , most memory consuming component is otel collector pod , more than 85% up to 96%. Regarding count of scrape_samples_scraped, any hint how to get those counts?

And also now we have another problem of buffer full , so now we dont see data being dropped due to memory full but instead export is failing due to buffer is full for exporter. Any idea on this as well. It looks like it shows lot of performance issue.

Here are the logs for buffer issue, so once this buffer issue resolved and starts exporting metrics it will be back to consuming memory and then we will be able to provide concrete memory consumption.
`
lp-prometheus-node-exporter-xxlrl", service="monitoring-dev-lp-prometheus-node-exporter"}"}
2023-05-17T17:09:56.629Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.", "dropped_items": 516}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1

2023-05-17T17:10:06.090Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1684343396088, "target_labels": "{name="up", endpoint="http-metrics-dnsmasq", instance="100.96.2.80:10054", job="kube-dns", namespace="kube-system", pod="coredns-7d8c995bfd-h68gg", service="prometheus-stack-kube-dns"}"}
2023-05-17T17:10:07.359Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.", "dropped_items": 71}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/bounded_memory_queue.go:60
`

@vermaprateek695
Copy link
Author

Hi @jaronoff97

Thanks for the reply and addressing the solution. we will check the documentation on how to add filter strategy. Also will provide the memory consumed by both target allocator and otel collector but from my observation from past couple of weeks, target allocator consume minimum memory , most memory consuming component is otel collector pod , more that 85% up to 96%. Regarding count of scrape_samples_scraped, any hint how to get those counts?

And also now we have another problem of buffer full , so now we dont see data being dropped due to memory full but instead export is failing due to buffer is full for exporter. Any idea on this as well. It looks like it shows lot of performance issue.

Here are the logs for buffer issue, so once this buffer issue resolved and starts exporting metrics it will be back to consuming memory and then we will be able to provide concrete memory consumption.
`
lp-prometheus-node-exporter-xxlrl", service="monitoring-dev-lp-prometheus-node-exporter"}"}
2023-05-17T17:09:56.629Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.", "dropped_items": 516}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1

2023-05-17T17:10:06.090Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1684343396088, "target_labels": "{name="up", endpoint="http-metrics-dnsmasq", instance="100.96.2.80:10054", job="kube-dns", namespace="kube-system", pod="coredns-7d8c995bfd-h68gg", service="prometheus-stack-kube-dns"}"}
2023-05-17T17:10:07.359Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.", "dropped_items": 71}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/bounded_memory_queue.go:60

@jaronoff97
Copy link
Contributor

Hello, not sure i've seen this one before... I'm wondering if tuning the exporter's configuration may help here... docs here. Something like decreasing the queue size?

@krimeshshah
Copy link

Hi @jaronoff97 ,

Decreasing queue size did not work for us and another option to use persistence queue to store queue in container filesystem instead of inmemory buffer throws error asking it cannot find the directory. As per the official documentation by default storage directory for file_storage is /var/lib/otelcol/file_storage -->https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage but it after deploying it throws an error with "file/directory doesnt exist". It is also not possible to exec to container to make directory or volume. Is there anyway to achieve this?
'I0522 22:11:54.022752 1555 round_trippers.go:463] Date: Mon, 22 May 2023 16:41:53 GMT
Error: invalid configuration: extensions::file_storage: directory must exist: stat /var/lib/otelcol/file_storage: no such file or directory
2023/05/22 16:41:28 collector server run finished with error: invalid configuration: extensions::file_storage: directory must exist: stat /var/lib/otelcol/file_storage: no such file or directory
'

@jaronoff97
Copy link
Contributor

hm i may ask @swiatekm-sumo who i think knows more about that codepath than i do.

@swiatekm
Copy link
Contributor

Reading through this issue, it doesn't look like your problem lies with the collector at all, but rather with the remote throwing errors when you try to export: Permanent error: rpc error: code = ResourceExhausted desc = Buffer is full, request timed out.. You're probably seeing memory usage increase because otlp exporter has the in-memory queue enabled by default, and the queue size is growing as you produce data faster than you can export it.

It's difficult to tell what exactly is the root cause here without further debugging. I'd proceed by doing the following:

  • confirm the queue size is increasing by looking at the otelcol_exporter_queue_size metric of the collector
  • check how much data you're actually exporting per collector Pod - see the otelcol_exporter_sent_metric_points metric
  • try reducing the amount of incoming data - probably easiest by adding a ServiceMonitor selector. You can also try adding more collector replicas

I think you should ignore the storage extension and the persistent queue for now. Your problem is most likely that your queue keeps growing, changing queue type won't help with that.

@vermaprateek695
Copy link
Author

vermaprateek695 commented May 24, 2023

Hi @swiatekm-sumo

We checked both the metrics (otelcol_exporter_queue_size, otelcol_exporter_sent_metric_points )and can see the value is zero for both the metrics. That's may be because of buffer gets full as soon as we deploy the otel collector

Do you have any clue or suggestions here which can help on this?

PFB screenshot for ref :
image
image

@swiatekm
Copy link
Contributor

It looks like the remote (lightstep I'm guessing?) is just rejecting your data. Are you sure your exporter configuration is correct? You can also try to look at some other exporter metrics to confirm your data is being dropped.

Overall though, you're going to need help from someone familiar with Lightstep and that Chart. I'm not sure why otel is using a lot memory with this setup. You could look at how much data you're scraping - I believe the metric for that is otelcol_receiver_accepted_metric_points.

@vermaprateek695
Copy link
Author

vermaprateek695 commented May 24, 2023

HI @swiatekm-sumo

We are not using light step as remote system but we are using kibana. Buffer issue just started coming since last week and before that it was just memory issue that used cause the data drop. we are using the same exporter configuration which used to work few weeks back.
 
Regarding otelcol_receiver_accepted_metric_points, we checked and we can see it scrapes immense amount of data .Please see attached screenshot. We will check if we can reduce some endpoint but as you can see in screenshot major scraping endpoint is monitoring which is having service monitor endpoints and other prometheus endpoints and it is required.

image

@vermaprateek695
Copy link
Author

Hi @swiatekm-sumo
Also in addition this is the scrape_config.yaml file where we have already commented out the scraping job to reduce the scraping endpoints and it still gives the buffer issue.

Correct me if this is not the right file to reduce the scraping endpoints.

PFB attached scrape_configs file

scrape_configs.txt

@swiatekm
Copy link
Contributor

According to your metrics, you only have an otlp exporter, and it's not sending any data whatsoever. Can you post the full otel configuration that you're currently using? The one you linked in the original issue only has this:

exporters:
  otlp:

which the collector should reject due to it being invalid.

@vermaprateek695
Copy link
Author

Hi @swiatekm-sumo
we are using exporter with otlp configuration details like endpoint and Auth , but unable to share the complete details due to security concern.

PFB screenshot for ref :

image

@vermaprateek695
Copy link
Author

vermaprateek695 commented May 25, 2023

otelcollector.txt
Hi @swiatekm-sumo

attaching the complete otel config.yaml file , do let me know if you need any additional details.

And also despite we have min scraping endpoints these many metrics point are failing , PFA screenshot for ref :
image

@swiatekm
Copy link
Contributor

Can you try replacing that exporter with the logging exporter temporarily and see how that affects the memory usage and these metrics?

As a side note, these metrics are all Sums, so they're easier to visualise with a rate operator.

@vermaprateek695
Copy link
Author

Hi @swiatekm-sumo ,

But isn't logging exporter for exporting logs or it can also export the metrics? And if logging exporter supports metrics pipeline can we use OTLP endpoint that we are using in current configuration to export metric to kibana?

Regards,
Prateek

@swiatekm
Copy link
Contributor

logging exporter supports all signal types, it's used for debugging. With the default configuration, it'll just log the number of data points it gets. I'd like you to replace your otlp exporter with logging, and see how that affects your setup.

@vermaprateek695
Copy link
Author

Hi @swiatekm-sumo

As per your suggestion we tried with logging exporter and it worked successfully , and we dont see any memory/buffer issue so far.

but now we want to export the metrics to kibana dashboard , so how do we export this metrics to any visualisation tool. We couldn't find any configuration for that.

@swiatekm
Copy link
Contributor

It looks like your Elasticsearch APM Server isn't accepting the metrics you're sending it over OTLP. Unfortunately I can't help you with that - you probably know more about this stack than I do. You'll have to debug that yourself.

@vermaprateek695
Copy link
Author

Hi @swiatekm-sumo

Our kibana server/dashboard does accept the metrics over otlp endpoint since the time we started using from 4-5 weeks. And we could see all the required and expected metrics in dashboard but eventually we started getting memory full issue. And then it swtiched to this buffer full issue. 
Yesterday after deploying otelcol with logging exporter as suggested by you. We again uninstalled and deployed it with otlp exporter with otlp expoint. And surprisingly now we have started seeing metrics in dashboard over otlp.

 

Is it possible that switching to logging exporter for almost 12 hrs might have cleared the exporter buffer and now when we again deployed with otlp exporter it started working? Also yesterday, we deployed otelcol on the cluster where there was no prometheus existing and it was running smoothly , does it imply that having he prometheus endpoints scraped consumes too much memory and buffer?

@swiatekm
Copy link
Contributor

I would encourage you to monitor the metrics we've looked at to see how much data you're actually sending and whether the queue size on your exporter is stable. If you'd like to troubleshoot the memory usage further, we're going to need more telemetry from your system, including the actual memory usage over time, datapoints sent, number of scrape targets, etc.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jul 26, 2023
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive Stale
Projects
None yet
Development

No branches or pull requests

6 participants