-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909
Comments
We are seeing this issue as well and it's affecting all the workloads that wants to export Prometheus metrics on Kubernetes. Having quickly reviewed the discovery and scraping packages from Prometheus, the usage of these packages seems to be as expected but I quickly noticed some possible issues in the code. For example, we possibly write to an error channel that is already closed. See https://github.com/open-telemetry/opentelemetry-collector/blob/master/receiver/prometheusreceiver/metrics_receiver.go#L70-L103. I wonder if this section needs a throughout restructuring/review again. I'm not very familiar with Prometheus' discovery manager and would appreciate some help. |
Is this issue specific to Kubernetes Endpoint objects, as in the config example ( |
@nilebox This issue is affecting only Pods and Endpoints. (I don't believe the rollout restart changes the configurations of any other Kubernetes targets). |
@nilebox I don't have a current fix for this issue at this moment and I'm not actively looking into it, FYI. Feel free to grab it if you have context on it. |
The Prometheus receiver is considered high-value by many participants in the OpenTelemetry metrics community. I suspect someone will pick this up soon, and I am happy to coordinate and discuss technical details. One thing also missing from this receiver is the Prometheus |
@alolita ^^^ |
@jmacd We have been working on this, but I'm still trying to figure out what exactly is causing the bug. Would love some help/guidance if you have the time. |
@jmacd we (@JasonXZLiu @alolita) will take a look at this. |
👍 for the issue, we are trying to use this but we are getting bit by the same error. |
For the Increasing the memory should help alleviate this problem. It may also help to limit the number/types of metrics that are being scraped in the relabel_configs and metric_relabel_configs. |
We are seeing the same problem too. |
It doesnt really seem to be anything related to the resources allocated to the Collector at least for my cause because the Operator doesn't actually set any resource limits at all. |
@0902horn we have a control loop that greps the logs for the correct message then kills the pod. A horrible hack whilst we're waiting on a fix. |
So we've done some digging, and it looks like we found the main problem:
We need to access the target metadata on the Collector side in order to access the metric labels. We're currently looking into some solutions like adding a non-blocking API on Prometheus' side to access this metadata. |
I've been able to reproduce this issue consistently on EKS but wanted to give it try on minikube to fasten my debugging cycle and I can't reproduce it anymore. The scraping errors I see are consistently coming from kube-system namespace. Not sure if it's consistent with the others' observation. |
We've consistently observed it ok EKS as well. However, not from |
Anecdotally, I found it harder to trigger on GKE (but still possible) than EKS. |
With #2089, I can observe it in kube-system and default namespaces but only for Kubernetes components, not for jobs I deployed:
default/kubernetes, kube-system namespace and collector's itself is not scrapeable. API server endpoint needs to be enabled for private access. https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html#cluster-endpoint-access-console. Other resources need authorization. |
talked about this at the collector sig mtg today and proposing to up the priority P1 since this issue is tightly coupled with metrics GA @bogdandrutu @tigrannajaryan @alolita can join the call on friday triage mtg to discuss if helpful |
I work with @kohrapha on code analysis and testing, and find the potential root cause: Problem:
We’ve added log and prove that this is the scenario cause the deadlock. Solutions:
kohrapha@ is working on verify the fix 1 above and will send out the PR soon. |
The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved. |
@hdj630 give me a few days to test and verify 👍 |
This should be fixed |
When is the next release scheduled? Would be nice to have a new one which includes this fix. |
FWIW, we're not observing the deadlocks with otelcol built from master including #2121 |
I think I am still experiencing the same issue. It's failing constantly on Amazon EKS. 2021-02-19T17:37:35.839Z WARN internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1613756255838, "target_labels": "map[instance:ip-192-168-24-101.us-east-2.compute.internal job:kubernetes-cadvisor]"} |
@bogdandrutu Can we reopen this? |
I found the cause for my failure case while scraping metrics from I enabled the
Then I had to add permission for kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"] |
I am still facing this issue. Even after adding
Can we reopen the issue? |
Testing same with prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']
- job_name: 'node'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:9100'] with otel/opentelemetry-collector-contrib:latest even after adding nodes/metrics in the ClusterRole Error:
|
For those using EKS with Terraform and |
…pen-telemetry#1909) Bumps [go.uber.org/zap](https://github.com/uber-go/zap) from 1.21.0 to 1.23.0. - [Release notes](https://github.com/uber-go/zap/releases) - [Changelog](https://github.com/uber-go/zap/blob/master/CHANGELOG.md) - [Commits](uber-go/zap@v1.21.0...v1.23.0) --- updated-dependencies: - dependency-name: go.uber.org/zap dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Describe the bug
otel-collector
running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered viakubernetes_sd_configs
stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).The receiver seems to face a deadlock somewhere in updating the SD targets group.
Steps to reproduce
otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
To trigger the issue, it's enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:
After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).
What did you expect to see?
Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.
What did you see instead?
Prometheus receiver scraping stops functioning completely.
What version did you use?
from
/debug/servicez
:What config did you use?
Config: (e.g. the yaml config file)
https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
Environment
Goversion go1.14.7
OS linux
Architecture amd64
Kubernetes 1.17 on EKS
Additional context
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest
master
.The text was updated successfully, but these errors were encountered: