-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue piles up and metrics are delayed if targets are unreachable #34229
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This doesn't look like an issue with the receiver. Does increasing the number of consumers help? |
No. It does not help. I think it's a performance issue with the exporter. After further experimentation, increasing the batch size to 50000 stopped me from having problems with the queue. |
Looks like the number of consumers is hard-coded to 1, which would explain why bumping that up doesn't help... opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/factory.go Lines 51 to 70 in 0f63b5a
We should probably emit a warning for people who try to configure this. |
So increasing the batch size is probably the only resolution we can offer, which you found. Action items:
|
I'm struggling a bit to understand how to log a warning. Which object has a logger object during the exporter creation? I can't find any 🤔 |
You can use opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/factory.go Line 38 in c6cda87
That is https://pkg.go.dev/go.opentelemetry.io/collector/[email protected]/internal#Settings It includes a Logger: https://pkg.go.dev/go.opentelemetry.io/collector/component#TelemetrySettings So you can do |
…te_write_queue.num_consumers being a no-op (#34993) **Description:** This PR documents and adds a warning log if remote_write_queue.num_consumers is set in the prometheusremotewriteexporter's config. Current behavior already doesn't use the configuration for anything, more information can be found in open-telemetry/opentelemetry-collector#2949 **Link to tracking Issue:** Related to #34229 (not a fix) Should we skip changelog here? Signed-off-by: Arthur Silva Sens <[email protected]>
…te_write_queue.num_consumers being a no-op (open-telemetry#34993) **Description:** This PR documents and adds a warning log if remote_write_queue.num_consumers is set in the prometheusremotewriteexporter's config. Current behavior already doesn't use the configuration for anything, more information can be found in open-telemetry/opentelemetry-collector#2949 **Link to tracking Issue:** Related to open-telemetry#34229 (not a fix) Should we skip changelog here? Signed-off-by: Arthur Silva Sens <[email protected]>
We started getting this warning after upgrading to v0.111, and found it and the README slightly misleading. The value of |
Thanks @ubcharron. @ArthurSens we should revert #34993 |
Whoops, my bad. Here is the PR: #35845 |
Discussed this during triage today. Given the large batch size you've already configured, this sounds like an issue that needs to be resolved with the backend. |
Hey @gbxavier, sorry for the delay here. I've contacted a few colleagues at Grafana, who told me that the recommended batch size for sending metrics to Grafana Cloud is 8192. Could you try adjusting and see if the queue size decreases? Another option to send metrics to Grafana Cloud is |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
exporter/prometheusremotewrite, receiver/prometheus
What happened?
Description
I have a dedicated instance to scrape metrics from pods exposing Prometheus Endpoints. The instance has more than enough resources to process a lot of metrics (resource utilization never reaches 60%). In this setup, most namespaces are under Network Policies preventing this scraper from reaching the pods found in the Kubernetes autodiscovery.
At first, the metrics reach the target (Grafana Cloud) as expected; but it's possible to immediately notice that the memory consumption keeps on growing and the queue size starts growing slowly until it reaches the capacity and enqueuing starts failing.
The amount of metrics received and sent remains constant, but over time, the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late (but still being received by the backend). This behavior is observed by all receivers configured in the instance, including the
prometheus/self
instance that doesn't face any problem scraping the metrics.This behavior only happens when the workload_prometheus is enabled, and no other instance suffers from this problem or any performance/limits issues.
Steps to Reproduce
Expected Result
The receiver scrapes the metrics from the endpoints it can reach and those metrics are correctly sent through the Prometheus Remote Write Exporter reasonably fast.
Actual Result
Memory consumption increases over time; the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late;
Collector version
0.105.0
Environment information
Environment
OS: AKSUbuntu-2204gen2containerd-202407.03.0
Kubernetes: Azure AKS v1.29.2
OpenTelemetry Collector configuration
Log output
Additional context
The resource utilization is low, but memory grows over time up to 50% of the configured limit, specified below.
Screenshot with metrics from this scraper instance.
The text was updated successfully, but these errors were encountered: