Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

oktocat · 2020-10-06T15:35:18Z

Describe the bug
otel-collector running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).
The receiver seems to face a deadlock somewhere in updating the SD targets group.

Steps to reproduce
otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml

To trigger the issue, it's enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:

{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}


{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}


(ad infinitum)

After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).

What did you expect to see?
Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.

What did you see instead?
Prometheus receiver scraping stops functioning completely.

What version did you use?
from /debug/servicez:

GitHash  c8aac9e3
BuildType release
Goversion  go1.14.7
OS  linux
Architecture amd64

What config did you use?
Config: (e.g. the yaml config file)
https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
Environment

Goversion go1.14.7
OS linux
Architecture amd64
Kubernetes 1.17 on EKS

Additional context
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master.

The text was updated successfully, but these errors were encountered:

rakyll · 2020-10-20T17:58:57Z

We are seeing this issue as well and it's affecting all the workloads that wants to export Prometheus metrics on Kubernetes.

Having quickly reviewed the discovery and scraping packages from Prometheus, the usage of these packages seems to be as expected but I quickly noticed some possible issues in the code. For example, we possibly write to an error channel that is already closed. See https://github.com/open-telemetry/opentelemetry-collector/blob/master/receiver/prometheusreceiver/metrics_receiver.go#L70-L103. I wonder if this section needs a throughout restructuring/review again. I'm not very familiar with Prometheus' discovery manager and would appreciate some help.

cc @bogdandrutu @dinooliva

nilebox · 2020-10-20T23:34:44Z

@rakyll @oktocat feel free to submit a PR with a fix and I can help with reviewing it.

nilebox · 2020-10-21T07:46:58Z

Is this issue specific to Kubernetes Endpoint objects, as in the config example (role: endpoint), or does it also affect Pod / Service / other Kubernetes targets?

JasonXZLiu · 2020-10-21T16:23:35Z

@nilebox This issue is affecting only Pods and Endpoints. (I don't believe the rollout restart changes the configurations of any other Kubernetes targets).

rakyll · 2020-10-27T20:05:57Z

@nilebox I don't have a current fix for this issue at this moment and I'm not actively looking into it, FYI. Feel free to grab it if you have context on it.

jmacd · 2020-10-30T20:24:48Z

The Prometheus receiver is considered high-value by many participants in the OpenTelemetry metrics community. I suspect someone will pick this up soon, and I am happy to coordinate and discuss technical details. One thing also missing from this receiver is the Prometheus up semantic convention, see the associated spec issue: open-telemetry/opentelemetry-specification#1102

jmacd · 2020-10-30T20:25:17Z

@alolita ^^^

JasonXZLiu · 2020-10-30T21:50:49Z

@jmacd We have been working on this, but I'm still trying to figure out what exactly is causing the bug. Would love some help/guidance if you have the time.

alolita · 2020-10-31T00:39:08Z

@jmacd we (@JasonXZLiu @alolita) will take a look at this.

ekarlso · 2020-11-02T12:30:09Z

👍 for the issue, we are trying to use this but we are getting bit by the same error.

JasonXZLiu · 2020-11-02T18:38:33Z

For the Discovery receiver's channel was full so will retry the next cycle issue, it seems like it is related to the resource memory allocated to the OTel Collector in the Kubernetes deployment. Prometheus has similar errors which can be seen here. Essentially, the server is being overloaded.

Increasing the memory should help alleviate this problem. It may also help to limit the number/types of metrics that are being scraped in the relabel_configs and metric_relabel_configs.

0902horn · 2020-11-03T06:11:20Z

We are seeing the same problem too.
Could we check the healthiness of prometheus receiver in health check extension? At least, otel-collector may recover from this issue automatically when the liveness probe fails if it is deployed in k8s.
Thanks.

ekarlso · 2020-11-03T06:46:30Z

It doesnt really seem to be anything related to the resources allocated to the Collector at least for my cause because the Operator doesn't actually set any resource limits at all.

liamawhite · 2020-11-03T11:52:02Z

@0902horn we have a control loop that greps the logs for the correct message then kills the pod. A horrible hack whilst we're waiting on a fix.

JasonXZLiu · 2020-11-05T18:26:44Z

So we've done some digging, and it looks like we found the main problem:

When targets reload in Prometheus, the scrape manager (from Prometheus) attempts to sync the scrape pools. This acquires a scrape.Manager mutex when the sync is performed. However, this sync function creates a new Storage.Appender (which is a Transaction in the OTel PrometheusReceiver). The sync function runs the new scrape pools (for the new targets) which eventually calls scrapeAndReport. This tries to add the metrics to the transaction. However, this add function (in the OTel Receiver) needs to initialize the transaction and get the metadata for its target by calling TargetsAll. This function needs to acquire a scrape.Manager mutex as well. However, the mutex is already locked previously, thus creating a deadlock.

We need to access the target metadata on the Collector side in order to access the metric labels. We're currently looking into some solutions like adding a non-blocking API on Prometheus' side to access this metadata.

rakyll · 2020-11-05T22:06:49Z

I've been able to reproduce this issue consistently on EKS but wanted to give it try on minikube to fasten my debugging cycle and I can't reproduce it anymore. The scraping errors I see are consistently coming from kube-system namespace. Not sure if it's consistent with the others' observation.

oktocat · 2020-11-05T22:08:38Z

I've been able to reproduce this issue consistently on EKS but wanted to give it try on minikube to fasten my debugging cycle and I can't reproduce it anymore. The scraping errors I see are consistently coming from kube-system namespace. Not sure if it's consistent with the others' observation.

We've consistently observed it ok EKS as well. However, not from kube-system ns, but application ns.

liamawhite · 2020-11-05T22:32:00Z

Anecdotally, I found it harder to trigger on GKE (but still possible) than EKS.

rakyll · 2020-11-09T01:27:50Z

With #2089, I can observe it in kube-system and default namespaces but only for Kubernetes components, not for jobs I deployed:

{"level":"warn","ts":1604883816.6850007,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883816684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883817.3347874,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883817333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883817.5157957,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883815514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883818.0711527,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883816069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883819.067344,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883817065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883821.9364944,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883821936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883822.3434792,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883822343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883823.4060252,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883823.6276975,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883824.0634017,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883824.390361,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824390,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883824.647588,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883824646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883825.5766149,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883825576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883825.9116442,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883823910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883826.6849763,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883826684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883827.3348367,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883827333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883827.5158937,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883825514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883828.071196,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883826069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883829.0673833,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883827065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883831.936537,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883831936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883832.3434823,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883832343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883833.4060094,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883833.627715,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883834.0633473,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883834.3903532,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834390,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883834.6473951,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883834646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883835.5767124,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883835576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883835.9117277,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883833910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883836.684964,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883836684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883837.334819,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883837333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883837.5156825,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883835514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883838.071217,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883836069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883839.06738,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883837065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883841.9365342,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883841936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883842.343505,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883842343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883843.406043,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883843.627718,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883844.0630987,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883844.390317,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844389,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883844.6474733,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883844646,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.106.135:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883845.5766456,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883845576,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.27.201:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-fmpbw pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883845.9117131,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883843910,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-xf7c5 pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883846.6849985,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883846684,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.27.201:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-jcbmh pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883847.3347833,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883847333,"target_labels":"map[Namespace:kube-system container_name:aws-vpc-cni-init controller_revision_hash:858b677c56 instance:192.168.46.20:80 job:kubernetes-pods k8s_app:aws-node pod_controller_kind:DaemonSet pod_controller_name:aws-node pod_name:aws-node-cgvkj pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883847.5157857,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883845514,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.40.203:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-xf7c5 pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883848.071159,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883846069,"target_labels":"map[Namespace:kube-system Service:kube-dns container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-service-endpoints k8s_app:kube-dns kubernetes_node:ip-192-168-46-20.ec2.internal pod_name:coredns-75b44cb5b4-62hcq pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883849.0673904,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883847065,"target_labels":"map[Namespace:kube-system container_name:coredns eks_amazonaws_com_component:coredns instance:192.168.52.57:53 job:kubernetes-pods k8s_app:kube-dns pod_controller_kind:ReplicaSet pod_controller_name:coredns-75b44cb5b4 pod_name:coredns-75b44cb5b4-62hcq pod_phase:Running pod_template_hash:75b44cb5b4]"}
{"level":"warn","ts":1604883851.936492,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883851936,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883852.343658,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883852343,"target_labels":"map[Namespace:otelcol Service:otel-collector app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-service-endpoints kubernetes_node:ip-192-168-27-201.ec2.internal pod_name:otel-collector-869d4bc96-wpwg5 pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883853.4059339,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883853405,"target_labels":"map[Namespace:kube-system container_name:kube-proxy controller_revision_hash:78db775dbb instance:192.168.46.20:80 job:kubernetes-pods k8s_app:kube-proxy pod_controller_kind:DaemonSet pod_controller_name:kube-proxy pod_name:kube-proxy-fmvsp pod_phase:Running pod_template_generation:1]"}
{"level":"warn","ts":1604883853.6276472,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883853627,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55680 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}
{"level":"warn","ts":1604883854.0631766,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883854062,"target_labels":"map[Namespace:default Service:kubernetes instance:192.168.66.101:443 job:kubernetes-service-endpoints]"}
{"level":"warn","ts":1604883854.390329,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1604883854389,"target_labels":"map[Namespace:otelcol app:opentelemetry component:otel-collector container_name:otel-collector instance:192.168.29.219:55679 job:kubernetes-pods pod_controller_kind:ReplicaSet pod_controller_name:otel-collector-869d4bc96 pod_name:otel-collector-869d4bc96-wpwg5 pod_phase:Running pod_template_hash:869d4bc96]"}

default/kubernetes, kube-system namespace and collector's itself is not scrapeable.

API server endpoint needs to be enabled for private access. https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html#cluster-endpoint-access-console. Other resources need authorization.

andrewhsu · 2020-11-11T17:25:25Z

talked about this at the collector sig mtg today and proposing to up the priority P1 since this issue is tightly coupled with metrics GA @bogdandrutu @tigrannajaryan

@alolita can join the call on friday triage mtg to discuss if helpful

hdj630 · 2020-11-11T17:51:05Z

I work with @kohrapha on code analysis and testing, and find the potential root cause:

Problem:
When Prometheus targets get removed (instead of getting started), there is a race condition to cause a deadlock on mutex “ScrapePool::mtx”.
How Does the Deadlock Happen:
When a bunch of Prometheus targets get removed:

ScrapePool::Sync get called, which locks ScrapePool::mtx and in turn call ScrapePool::sync.
Inside ScrapePool::sync, it still hold the mtx and it will wait for all the ScrapeLoops of all removed targets to exit.
However, there is a certain chance that some ScrapeLoops cannot exit because they try to lock ScrapePool::mtx too, which causes the deadlock.
a. For why does ScrapeLoop try to lock ScrapePool::mtx: It is because there is a synchronous call inside transaction::Add which eventually calls to ScrapeManager::TargetsAll synchronously (mentioned by @JasonXZLiu above). It in turn calls to ScrapePool::ActiveTargets which try to lock ScrapePool::mtx.

We’ve added log and prove that this is the scenario cause the deadlock.

Solutions:

[Quick] There is an enhancement in Prometheus upstream which was committed 16 days ago. It use a fine grained lock inside ScrapePool, which seems to be able to solve this deadlock.
[Alternative] In order to prevent such deadlock happen again. It’s better to call ScrapeManager::TargetsAll asynchronously in PrometheusReceiver.

kohrapha@ is working on verify the fix 1 above and will send out the PR soon.

hdj630 · 2020-11-12T18:38:17Z

The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved.

oktocat · 2020-11-13T11:29:41Z

@hdj630 give me a few days to test and verify 👍

…ver (#2089) * Fix the scraper/discover manager coordination on the Prometheus receiver The receiver contains various unnecessary sections. Rewriting the receiver's Start for better maintainability. Related to #1909. * Use the background context * Remove dead code

bogdandrutu · 2020-11-13T16:49:34Z

This should be fixed

johanbrandhorst · 2020-11-14T11:27:47Z

When is the next release scheduled? Would be nice to have a new one which includes this fix.

oktocat · 2020-11-18T13:59:50Z

FWIW, we're not observing the deadlocks with otelcol built from master including #2121

hossain-rayhan · 2021-02-19T17:42:23Z

I think I am still experiencing the same issue. It's failing constantly on Amazon EKS.

2021-02-19T17:37:35.839Z WARN internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1613756255838, "target_labels": "map[instance:ip-192-168-24-101.us-east-2.compute.internal job:kubernetes-cadvisor]"}
go.opentelemetry.io/collector/receiver/prometheusreceiver/internal.(*metricBuilder).AddDataPoint
go.opentelemetry.io/[email protected]/receiver/prometheusreceiver/internal/metricsbuilder.go:104
go.opentelemetry.io/collector/receiver/prometheusreceiver/internal.(*transaction).Add
go.opentelemetry.io/[email protected]/receiver/prometheusreceiver/internal/transaction.go:115
github.com/prometheus/prometheus/scrape.(*timeLimitAppender).Add
github.com/prometheus/[email protected]/scrape/target.go:328
github.com/prometheus/prometheus/scrape.(*limitAppender).Add
github.com/prometheus/[email protected]/scrape/target.go:299
github.com/prometheus/prometheus/scrape.(*scrapeLoop).addReportSample
github.com/prometheus/[email protected]/scrape/scrape.go:1522
github.com/prometheus/prometheus/scrape.(*scrapeLoop).report
github.com/prometheus/[email protected]/scrape/scrape.go:1454
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
github.com/prometheus/[email protected]/scrape/scrape.go:1090
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
github.com/prometheus/[email protected]/scrape/scrape.go:1150
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/[email protected]/scrape/scrape.go:1036

rakyll · 2021-02-22T21:36:24Z

@bogdandrutu Can we reopen this?

hossain-rayhan · 2021-02-24T20:55:40Z

I found the cause for my failure case while scraping metrics from metrics/cadvisor endpoint.

I enabled the --log-level=DEBUG and it gave me the following insights. I was getting Forbidden- 403.

2021-02-24T19:18:01.859Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614194281848, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-09892500d4bf9388b alpha_eksctl_io_nodegroup_name:ng-1-workers beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-173-241.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-173-241.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.xlarge node_lifecycle:on-demand role:workers topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}
2021-02-24T19:18:02.105Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://192.168.125.115:10250/metrics/cadvisor", "err": "server returned HTTP status 403 Forbidden", "errVerbose": "server returned HTTP status 403 Forbidden\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

Then I had to add permission for -nodes/metrics in my ClusterRole and finally it worked.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role
rules:
  - apiGroups: [""]
    resources:
    - nodes
    - nodes/proxy
    - nodes/metrics
    - services
    - endpoints
    - pods
    verbs: ["get", "list", "watch"]

vishalsaugat · 2021-03-07T21:39:06Z

I am still facing this issue. Even after adding nodes/metrics in the ClusterRole.


2021-03-07T21:31:01.625Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152661620, "target_labels": "map[alpha_eksctl_io_cluster_name:c _nodegroup_name:t3-s beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:t3.small beta_kubernetes_io_os:linux eks_amazonaws_com_capacityType:ON_DEMAND eks_amazonaws_com_nodegroup:t3-small-nodegroup eks_amazonaws_com_nodegroup_image:ami-xx eks_amazonaws_com_sourceLaunchTemplateId:lt-x eks_amazonaws_com_sourceLaunchTemplateVersion:1 failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2b instance:ip-xx-yy-zz.us-east-2.compute.internal job:kubernetes-nodes kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-xx-yy-zz-235.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:t3.small topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2b]"}

2021-03-07T21:31:03.364Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152663358, "target_labels": "map[instance:adot-collector.adot-col.svc:8888 job:kubernetes-service]"}

Can we reopen the issue?

gizas · 2022-08-09T15:28:19Z

Testing same with

      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:8888']
          - job_name: 'node'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:9100']

with otel/opentelemetry-collector-contrib:latest even after adding nodes/metrics in the ClusterRole

Error:

2022-08-09T15:28:04.525Z        warn    internal/otlp_metricsbuilder.go:164     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_timestamp": 1660058884524, "target_labels": "map[__name__:up instance:0.0.0.0:9100 job:node]"}

ianrodrigues · 2022-10-02T03:05:17Z

For those using EKS with Terraform and templatefile to set exporters.awsemf.region dynamically: change $${1} and $${2} by $$${1} and $$${2} respectively, as templatefile is trying to evaluate them.

…pen-telemetry#1909) Bumps [go.uber.org/zap](https://github.com/uber-go/zap) from 1.21.0 to 1.23.0. - [Release notes](https://github.com/uber-go/zap/releases) - [Changelog](https://github.com/uber-go/zap/blob/master/CHANGELOG.md) - [Commits](uber-go/zap@v1.21.0...v1.23.0) --- updated-dependencies: - dependency-name: go.uber.org/zap dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

oktocat added the bug Something isn't working label Oct 6, 2020

andrewhsu added priority:p3 Lowest spec:metrics labels Nov 4, 2020

rakyll mentioned this issue Nov 9, 2020

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

Merged

rakyll mentioned this issue Nov 9, 2020

Prometheus receiver fails for kube-system components and the API server aws-observability/aws-otel-collector#97

Closed

kohrapha mentioned this issue Nov 11, 2020

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

Merged

bogdandrutu closed this as completed Nov 13, 2020

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024

Opentracing shim (open-telemetry#1909)

4dff60a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

oktocat commented Oct 6, 2020

rakyll commented Oct 20, 2020

nilebox commented Oct 20, 2020

nilebox commented Oct 21, 2020

JasonXZLiu commented Oct 21, 2020

rakyll commented Oct 27, 2020

jmacd commented Oct 30, 2020

jmacd commented Oct 30, 2020

JasonXZLiu commented Oct 30, 2020

alolita commented Oct 31, 2020

ekarlso commented Nov 2, 2020

JasonXZLiu commented Nov 2, 2020

0902horn commented Nov 3, 2020

ekarlso commented Nov 3, 2020

liamawhite commented Nov 3, 2020 •

edited

Loading

JasonXZLiu commented Nov 5, 2020 •

edited

Loading

rakyll commented Nov 5, 2020

oktocat commented Nov 5, 2020

liamawhite commented Nov 5, 2020

rakyll commented Nov 9, 2020

andrewhsu commented Nov 11, 2020 •

edited

Loading

hdj630 commented Nov 11, 2020 •

edited

Loading

hdj630 commented Nov 12, 2020

oktocat commented Nov 13, 2020

bogdandrutu commented Nov 13, 2020

johanbrandhorst commented Nov 14, 2020

oktocat commented Nov 18, 2020

hossain-rayhan commented Feb 19, 2021

rakyll commented Feb 22, 2021

hossain-rayhan commented Feb 24, 2021

vishalsaugat commented Mar 7, 2021 •

edited

Loading

gizas commented Aug 9, 2022

ianrodrigues commented Oct 2, 2022

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Comments

oktocat commented Oct 6, 2020

rakyll commented Oct 20, 2020

nilebox commented Oct 20, 2020

nilebox commented Oct 21, 2020

JasonXZLiu commented Oct 21, 2020

rakyll commented Oct 27, 2020

jmacd commented Oct 30, 2020

jmacd commented Oct 30, 2020

JasonXZLiu commented Oct 30, 2020

alolita commented Oct 31, 2020

ekarlso commented Nov 2, 2020

JasonXZLiu commented Nov 2, 2020

0902horn commented Nov 3, 2020

ekarlso commented Nov 3, 2020

liamawhite commented Nov 3, 2020 • edited Loading

JasonXZLiu commented Nov 5, 2020 • edited Loading

rakyll commented Nov 5, 2020

oktocat commented Nov 5, 2020

liamawhite commented Nov 5, 2020

rakyll commented Nov 9, 2020

andrewhsu commented Nov 11, 2020 • edited Loading

hdj630 commented Nov 11, 2020 • edited Loading

hdj630 commented Nov 12, 2020

oktocat commented Nov 13, 2020

bogdandrutu commented Nov 13, 2020

johanbrandhorst commented Nov 14, 2020

oktocat commented Nov 18, 2020

hossain-rayhan commented Feb 19, 2021

rakyll commented Feb 22, 2021

hossain-rayhan commented Feb 24, 2021

vishalsaugat commented Mar 7, 2021 • edited Loading

gizas commented Aug 9, 2022

ianrodrigues commented Oct 2, 2022

liamawhite commented Nov 3, 2020 •

edited

Loading

JasonXZLiu commented Nov 5, 2020 •

edited

Loading

andrewhsu commented Nov 11, 2020 •

edited

Loading

hdj630 commented Nov 11, 2020 •

edited

Loading

vishalsaugat commented Mar 7, 2021 •

edited

Loading