Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Closed
oktocat opened this issue Oct 6, 2020 · 32 comments
Labels
bug Something isn't working priority:p3 Lowest

Comments

@oktocat
Copy link

oktocat commented Oct 6, 2020

Describe the bug
otel-collector running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).
The receiver seems to face a deadlock somewhere in updating the SD targets group.

Steps to reproduce
otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml

To trigger the issue, it's enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:

{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}


{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}


(ad infinitum)

After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).

What did you expect to see?
Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.

What did you see instead?
Prometheus receiver scraping stops functioning completely.

What version did you use?
from /debug/servicez:

GitHash  c8aac9e3
BuildType release
Goversion  go1.14.7
OS  linux
Architecture amd64

What config did you use?
Config: (e.g. the yaml config file)
https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
Environment

Goversion go1.14.7
OS linux
Architecture amd64
Kubernetes 1.17 on EKS

Additional context
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master.

@oktocat oktocat added the bug Something isn't working label Oct 6, 2020
@rakyll
Copy link
Contributor

rakyll commented Oct 20, 2020

We are seeing this issue as well and it's affecting all the workloads that wants to export Prometheus metrics on Kubernetes.

Having quickly reviewed the discovery and scraping packages from Prometheus, the usage of these packages seems to be as expected but I quickly noticed some possible issues in the code. For example, we possibly write to an error channel that is already closed. See https://github.com/open-telemetry/opentelemetry-collector/blob/master/receiver/prometheusreceiver/metrics_receiver.go#L70-L103. I wonder if this section needs a throughout restructuring/review again. I'm not very familiar with Prometheus' discovery manager and would appreciate some help.

cc @bogdandrutu @dinooliva

@nilebox
Copy link
Member

nilebox commented Oct 20, 2020

@rakyll @oktocat feel free to submit a PR with a fix and I can help with reviewing it.

@nilebox
Copy link
Member

nilebox commented Oct 21, 2020

Is this issue specific to Kubernetes Endpoint objects, as in the config example (role: endpoint), or does it also affect Pod / Service / other Kubernetes targets?

@JasonXZLiu
Copy link
Member

@nilebox This issue is affecting only Pods and Endpoints. (I don't believe the rollout restart changes the configurations of any other Kubernetes targets).

@rakyll
Copy link
Contributor

rakyll commented Oct 27, 2020

@nilebox I don't have a current fix for this issue at this moment and I'm not actively looking into it, FYI. Feel free to grab it if you have context on it.

@jmacd
Copy link
Contributor

jmacd commented Oct 30, 2020

The Prometheus receiver is considered high-value by many participants in the OpenTelemetry metrics community. I suspect someone will pick this up soon, and I am happy to coordinate and discuss technical details. One thing also missing from this receiver is the Prometheus up semantic convention, see the associated spec issue: open-telemetry/opentelemetry-specification#1102

@jmacd
Copy link
Contributor

jmacd commented Oct 30, 2020

@alolita ^^^

@JasonXZLiu
Copy link
Member

@jmacd We have been working on this, but I'm still trying to figure out what exactly is causing the bug. Would love some help/guidance if you have the time.

@alolita
Copy link
Member

alolita commented Oct 31, 2020

@jmacd we (@JasonXZLiu @alolita) will take a look at this.

@ekarlso
Copy link

ekarlso commented Nov 2, 2020

👍 for the issue, we are trying to use this but we are getting bit by the same error.

@JasonXZLiu
Copy link
Member

For the Discovery receiver's channel was full so will retry the next cycle issue, it seems like it is related to the resource memory allocated to the OTel Collector in the Kubernetes deployment. Prometheus has similar errors which can be seen here. Essentially, the server is being overloaded.

Increasing the memory should help alleviate this problem. It may also help to limit the number/types of metrics that are being scraped in the relabel_configs and metric_relabel_configs.

@0902horn
Copy link

0902horn commented Nov 3, 2020

We are seeing the same problem too.
Could we check the healthiness of prometheus receiver in health check extension? At least, otel-collector may recover from this issue automatically when the liveness probe fails if it is deployed in k8s.
Thanks.

@ekarlso
Copy link

ekarlso commented Nov 3, 2020

It doesnt really seem to be anything related to the resources allocated to the Collector at least for my cause because the Operator doesn't actually set any resource limits at all.

@liamawhite
Copy link
Contributor

liamawhite commented Nov 3, 2020

@0902horn we have a control loop that greps the logs for the correct message then kills the pod. A horrible hack whilst we're waiting on a fix.

@JasonXZLiu
Copy link
Member

JasonXZLiu commented Nov 5, 2020

So we've done some digging, and it looks like we found the main problem:

When targets reload in Prometheus, the scrape manager (from Prometheus) attempts to sync the scrape pools. This acquires a scrape.Manager mutex when the sync is performed. However, this sync function creates a new Storage.Appender (which is a Transaction in the OTel PrometheusReceiver). The sync function runs the new scrape pools (for the new targets) which eventually calls scrapeAndReport. This tries to add the metrics to the transaction. However, this add function (in the OTel Receiver) needs to initialize the transaction and get the metadata for its target by calling TargetsAll. This function needs to acquire a scrape.Manager mutex as well. However, the mutex is already locked previously, thus creating a deadlock.

We need to access the target metadata on the Collector side in order to access the metric labels. We're currently looking into some solutions like adding a non-blocking API on Prometheus' side to access this metadata.

@rakyll
Copy link
Contributor

rakyll commented Nov 5, 2020

I've been able to reproduce this issue consistently on EKS but wanted to give it try on minikube to fasten my debugging cycle and I can't reproduce it anymore. The scraping errors I see are consistently coming from kube-system namespace. Not sure if it's consistent with the others' observation.

@oktocat
Copy link
Author

oktocat commented Nov 5, 2020

I've been able to reproduce this issue consistently on EKS but wanted to give it try on minikube to fasten my debugging cycle and I can't reproduce it anymore. The scraping errors I see are consistently coming from kube-system namespace. Not sure if it's consistent with the others' observation.

We've consistently observed it ok EKS as well. However, not from kube-system ns, but application ns.

@liamawhite
Copy link
Contributor

Anecdotally, I found it harder to trigger on GKE (but still possible) than EKS.

@rakyll
Copy link
Contributor

rakyll commented Nov 9, 2020

With #2089, I can observe it in kube-system and default namespaces but only for Kubernetes components, not for jobs I deployed:

{"level":"warn","ts":1604883816.6850007,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883816684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883817.3347874,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883817333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883817.5157957,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883815514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883818.0711527,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883816069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883819.067344,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883817065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883821.9364944,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883821936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883822.3434792,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883822343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883823.4060252,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883823.6276975,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883824.0634017,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883824.390361,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824390,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883824.647588,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883825.5766149,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883825576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883825.9116442,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883826.6849763,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883826684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883827.3348367,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883827333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883827.5158937,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883825514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883828.071196,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883826069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883829.0673833,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883827065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883831.936537,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883831936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883832.3434823,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883832343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883833.4060094,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883833.627715,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883834.0633473,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883834.3903532,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834390,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883834.6473951,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883835.5767124,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883835576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883835.9117277,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883836.684964,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883836684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883837.334819,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883837333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883837.5156825,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883835514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883838.071217,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883836069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883839.06738,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883837065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883841.9365342,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883841936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883842.343505,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883842343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883843.406043,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883843.627718,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883844.0630987,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883844.390317,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844389,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883844.6474733,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883845.5766456,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883845576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883845.9117131,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883846.6849985,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883846684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883847.3347833,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883847333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883847.5157857,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883845514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883848.071159,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883846069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883849.0673904,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883847065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883851.936492,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883851936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883852.343658,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883852343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883853.4059339,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883853405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883853.6276472,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883853627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883854.0631766,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883854062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883854.390329,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883854389,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}

default/kubernetes, kube-system namespace and collector's itself is not scrapeable.

API server endpoint needs to be enabled for private access. https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html#cluster-endpoint-access-console. Other resources need authorization.

@andrewhsu
Copy link
Member

andrewhsu commented Nov 11, 2020

talked about this at the collector sig mtg today and proposing to up the priority P1 since this issue is tightly coupled with metrics GA @bogdandrutu @tigrannajaryan

@alolita can join the call on friday triage mtg to discuss if helpful

@hdj630
Copy link

hdj630 commented Nov 11, 2020

I work with @kohrapha on code analysis and testing, and find the potential root cause:

Problem:
When Prometheus targets get removed (instead of getting started), there is a race condition to cause a deadlock on mutex “ScrapePool::mtx”.
How Does the Deadlock Happen:
When a bunch of Prometheus targets get removed:

  1. ScrapePool::Sync get called, which locks ScrapePool::mtx and in turn call ScrapePool::sync.
  2. Inside ScrapePool::sync, it still hold the mtx and it will wait for all the ScrapeLoops of all removed targets to exit.
  3. However, there is a certain chance that some ScrapeLoops cannot exit because they try to lock ScrapePool::mtx too, which causes the deadlock.
    a. For why does ScrapeLoop try to lock ScrapePool::mtx: It is because there is a synchronous call inside transaction::Add which eventually calls to ScrapeManager::TargetsAll synchronously (mentioned by @JasonXZLiu above). It in turn calls to ScrapePool::ActiveTargets which try to lock ScrapePool::mtx.

We’ve added log and prove that this is the scenario cause the deadlock.

Solutions:

  1. [Quick] There is an enhancement in Prometheus upstream which was committed 16 days ago. It use a fine grained lock inside ScrapePool, which seems to be able to solve this deadlock.
  2. [Alternative] In order to prevent such deadlock happen again. It’s better to call ScrapeManager::TargetsAll asynchronously in PrometheusReceiver.

kohrapha@ is working on verify the fix 1 above and will send out the PR soon.

@hdj630
Copy link

hdj630 commented Nov 12, 2020

The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved.

@oktocat
Copy link
Author

oktocat commented Nov 13, 2020

@hdj630 give me a few days to test and verify 👍

bogdandrutu pushed a commit that referenced this issue Nov 13, 2020
…ver (#2089)

* Fix the scraper/discover manager coordination on the Prometheus receiver

The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.

Related to #1909.

* Use the background context

* Remove dead code
@bogdandrutu
Copy link
Member

This should be fixed

@johanbrandhorst
Copy link

When is the next release scheduled? Would be nice to have a new one which includes this fix.

@oktocat
Copy link
Author

oktocat commented Nov 18, 2020

FWIW, we're not observing the deadlocks with otelcol built from master including #2121

@hossain-rayhan
Copy link
Contributor

I think I am still experiencing the same issue. It's failing constantly on Amazon EKS.

2021-02-19T17:37:35.839Z WARN internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1613756255838, "target_labels": "map[instance:ip-192-168-24-101.us-east-2.compute.internal job:kubernetes-cadvisor]"}
go.opentelemetry.io/collector/receiver/prometheusreceiver/internal.(*metricBuilder).AddDataPoint
go.opentelemetry.io/[email protected]/receiver/prometheusreceiver/internal/metricsbuilder.go:104
go.opentelemetry.io/collector/receiver/prometheusreceiver/internal.(*transaction).Add
go.opentelemetry.io/[email protected]/receiver/prometheusreceiver/internal/transaction.go:115
github.com/prometheus/prometheus/scrape.(*timeLimitAppender).Add
github.com/prometheus/[email protected]/scrape/target.go:328
github.com/prometheus/prometheus/scrape.(*limitAppender).Add
github.com/prometheus/[email protected]/scrape/target.go:299
github.com/prometheus/prometheus/scrape.(*scrapeLoop).addReportSample
github.com/prometheus/[email protected]/scrape/scrape.go:1522
github.com/prometheus/prometheus/scrape.(*scrapeLoop).report
github.com/prometheus/[email protected]/scrape/scrape.go:1454
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
github.com/prometheus/[email protected]/scrape/scrape.go:1090
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
github.com/prometheus/[email protected]/scrape/scrape.go:1150
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/[email protected]/scrape/scrape.go:1036

@rakyll
Copy link
Contributor

rakyll commented Feb 22, 2021

@bogdandrutu Can we reopen this?

@hossain-rayhan
Copy link
Contributor

I found the cause for my failure case while scraping metrics from metrics/cadvisor endpoint.

I enabled the --log-level=DEBUG and it gave me the following insights. I was getting Forbidden- 403.

2021-02-24T19:18:01.859Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614194281848, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-09892500d4bf9388b alpha_eksctl_io_nodegroup_name:ng-1-workers beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-173-241.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-173-241.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.xlarge node_lifecycle:on-demand role:workers topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}
2021-02-24T19:18:02.105Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://192.168.125.115:10250/metrics/cadvisor", "err": "server returned HTTP status 403 Forbidden", "errVerbose": "server returned HTTP status 403 Forbidden\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

Then I had to add permission for -nodes/metrics in my ClusterRole and finally it worked.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role
rules:
  - apiGroups: [""]
    resources:
    - nodes
    - nodes/proxy
    - nodes/metrics
    - services
    - endpoints
    - pods
    verbs: ["get", "list", "watch"]

@vishalsaugat
Copy link

vishalsaugat commented Mar 7, 2021

I am still facing this issue. Even after adding nodes/metrics in the ClusterRole.


2021-03-07T21:31:01.625Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152661620, "target_labels": "map[alpha_eksctl_io_cluster_name:c _nodegroup_name:t3-s beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:t3.small beta_kubernetes_io_os:linux eks_amazonaws_com_capacityType:ON_DEMAND eks_amazonaws_com_nodegroup:t3-small-nodegroup eks_amazonaws_com_nodegroup_image:ami-xx eks_amazonaws_com_sourceLaunchTemplateId:lt-x eks_amazonaws_com_sourceLaunchTemplateVersion:1 failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2b instance:ip-xx-yy-zz.us-east-2.compute.internal job:kubernetes-nodes kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-xx-yy-zz-235.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:t3.small topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2b]"}

2021-03-07T21:31:03.364Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152663358, "target_labels": "map[instance:adot-collector.adot-col.svc:8888 job:kubernetes-service]"}

Can we reopen the issue?

@gizas
Copy link

gizas commented Aug 9, 2022

Testing same with

      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:8888']
          - job_name: 'node'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:9100']

with otel/opentelemetry-collector-contrib:latest even after adding nodes/metrics in the ClusterRole

Error:

2022-08-09T15:28:04.525Z        warn    internal/otlp_metricsbuilder.go:164     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_timestamp": 1660058884524, "target_labels": "map[__name__:up instance:0.0.0.0:9100 job:node]"}

@ianrodrigues
Copy link

For those using EKS with Terraform and templatefile to set exporters.awsemf.region dynamically: change $${1} and $${2} by $$${1} and $$${2} respectively, as templatefile is trying to evaluate them.

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
…pen-telemetry#1909)

Bumps [go.uber.org/zap](https://github.com/uber-go/zap) from 1.21.0 to 1.23.0.
- [Release notes](https://github.com/uber-go/zap/releases)
- [Changelog](https://github.com/uber-go/zap/blob/master/CHANGELOG.md)
- [Commits](uber-go/zap@v1.21.0...v1.23.0)

---
updated-dependencies:
- dependency-name: go.uber.org/zap
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p3 Lowest
Projects
None yet
Development

No branches or pull requests