Improve error message for prometheus scrape failure #2364

yyyogev · 2021-01-13T10:42:20Z

If a prometheus receiver, fails to scrape an endpoint, the output shows something like:

2021-01-12T21:11:10.496Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1610485870495, "target_labels": "map[instance:aks-agentpool-30099824-vmss000002 job:cadvisor]"}

The reason for the failure isn't presented here and it's almost impossible to figure out what is wrong

I would suggest enriching the warn message with an actual reason

The text was updated successfully, but these errors were encountered:

plazma-prizma · 2021-01-22T22:14:40Z

Hello, I'd like to pick it up as my first contribution to open telemetry 🙂 Though I need some initial guidance, quick search for metricsbuilder in this repository didn't help 😬

hossain-rayhan · 2021-02-22T18:41:50Z

Facing similar issue. I have no idea whats going wrong. In my case, it works once when I scrape the Pods but never worked when I am trying to scrape /mectrics/cadvisor.

WARN internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614018163539, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-0cd2e741621ad97d0 alpha_eksctl_io_nodegroup_name:ng-2-builders beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.2xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-181-249.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-181-249.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.2xlarge node_lifecycle:on-demand role:builders topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}

Here is the full kubernetes_sd_congig:

kubernetes_sd_configs:
            - role: node
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
                # Only for Kubernetes ^1.7.3.
                # See: https://github.com/prometheus/prometheus/issues/2916
              - target_label: __address__
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
            metric_relabel_configs:
              - action: replace
                source_labels: [id]
                regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
                target_label: rkt_container_name
                replacement: '${2}-${1}'
              - action: replace
                source_labels: [id]
                regex: '^/system\.slice/(.+)\.service$'
                target_label: systemd_service_name
                replacement: '${1}'

dashpole · 2021-02-24T19:02:17Z

can you try running with the command line flag
--log-level=DEBUG?

When I do that with an intentionally misconfigured collector, I see messages like:

2021-02-24T18:57:16.040Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614193036037, "target_labels": "map[beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:e2-medium beta_kubernetes_io_os:linux cloud_google_com_gke_boot_disk:pd-standard cloud_google_com_gke_nodepool:default-pool cloud_google_com_gke_os_distribution:cos cloud_google_com_machine_family:e2 failure_domain_beta_kubernetes_io_region:us-central1 failure_domain_beta_kubernetes_io_zone:us-central1-c instance:gke-cluster-1-default-pool-746cd6d9-mziw job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:gke-cluster-1-default-pool-746cd6d9-mziw kubernetes_io_os:linux node_kubernetes_io_instance_type:e2-medium topology_kubernetes_io_region:us-central1 topology_kubernetes_io_zone:us-central1-c]"}
2021-02-24T18:57:16.478Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://10.128.15.217:10250/metrics/cadvisor/foo", "err": "server returned HTTP status 404 Not Found", "errVerbose": "server returned HTTP status 404 Not Found\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

The debug log shows me that my problem was a 404, which is correct for me since I put in an incorrect metrics_path in my prometheus config.

hossain-rayhan · 2021-02-24T19:22:33Z

@dashpole thanks, it gives more insights.

I am getting 403 Forbidden. Not sure which permission I missed.

2021-02-24T19:18:01.859Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614194281848, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-09892500d4bf9388b alpha_eksctl_io_nodegroup_name:ng-1-workers beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-173-241.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-173-241.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.xlarge node_lifecycle:on-demand role:workers topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}
2021-02-24T19:18:02.105Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://192.168.125.115:10250/metrics/cadvisor", "err": "server returned HTTP status 403 Forbidden", "errVerbose": "server returned HTTP status 403 Forbidden\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

hossain-rayhan · 2021-02-24T20:46:49Z

I resolved my issue. I was missing permission for - nodes/metrics. After adding this in my ClusterRole it worked for me.

One thing, maybe we can upgrade the label from debug to warn for detailed failing reason.

dashpole · 2021-02-24T20:56:57Z

The line in question: in scrape/scrape.go is in the prometheus server, so without forking it we can't modify the level. The same question was raised in upstream prometheus here: prometheus/prometheus#2820 (comment), and it wasn't deemed appropriate to raise the verbosity

glaenen · 2021-03-02T14:43:56Z

node/metrics

I resolved my issue. I was missing permission for - node/metrics. After adding this in my ClusterRole it worked for me.

One thing, maybe we can upgrade the label from debug to warn for detailed failing reason.

Thanks pointing me in the right direction.
Small correction, the missing permission is "nodes/metrics"

naseemkullah · 2021-03-05T22:03:30Z

@nilebox I see you've worked on the prometheus receiver logging, would you have any idea how to wait for the subsequent debug log (coming from prometheus server) and put its err field into AddDataPoint's warn log?

nilebox · 2021-03-07T23:40:19Z

any idea how to wait for the subsequent debug log

@naseemkullah in

opentelemetry-collector/receiver/prometheusreceiver/internal/logger.go

Line 46 in 68871bd

func (w *zapToGokitLogAdapter) Log(keyvals ...interface{}) error {

, we get key-value pairs from go-kit (used by Prometheus) for each log message, which gets parsed and forwarded to zap logger.

You may change the code to catch specific errors there (e.g. based on known error message), and convert it to warn log message in zap.

P.S. I'm no longer working on this project, so won't be able to help you beyond this advice unfortunately.

…rn for easy display This change transforms Prometheus created .Debug level errors such as failed scrape message reasons into a level that be displayed to collector users, without them having to use --log-level=DEBUG. In 2017, a Prometheus PR prometheus/prometheus#3135 added the failure reason displays with a .Debug level. This change now ensures that a Prometheus log that's routed from say a scrape failure that was logged originally from Prometheus as: 2021-04-09T22:58:51.732-0700 debug scrape/scrape.go:1127 Scrape failed {"kind": "receiver", "name": "prometheus", "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics", "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"} will now get transformed to: 2021-04-09T23:24:41.733-0700 warn internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"} which will now be surfaced to users. Fixes open-telemetry#2364

odeke-em · 2021-04-10T06:35:13Z

I've mailed out a fix for it in #2906

…rn for easy display This change transforms Prometheus created .Debug level errors such as failed scrape message reasons into a level that be displayed to collector users, without them having to use --log-level=DEBUG. In 2017, a Prometheus PR prometheus/prometheus#3135 added the failure reason displays with a .Debug level. This change now ensures that a Prometheus log that's routed from say a scrape failure that was logged originally from Prometheus as: 2021-04-09T22:58:51.732-0700 debug scrape/scrape.go:1127 Scrape failed {"kind": "receiver", "name": "prometheus", "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics", "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"} will now get transformed to: 2021-04-09T23:24:41.733-0700 warn internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"} which will now be surfaced to users. Fixes open-telemetry#2364

anuraaga · 2021-05-08T03:04:36Z

AFAIK logging detailed information at debug level is consistent with the rest of the collector so not sure this change is a good idea. Could be wrong though. @bogdandrutu @tigrannajaryan could you clarify any best practices for logging in collector and whether it makes sense to use a different default than prometheus here?

odeke-em · 2021-08-07T06:53:08Z

Yeah from discussions in the Prometheus Working Group and other discussions, we came to the conclusion that we shouldn't do this. @alolita @Aneurysm9 could you please help perhaps mark as this a "Won't Fix" and then we can close it? Thank you.

ringerc · 2023-11-15T22:43:31Z

This error also appears to be emitted when the endpoint gets scraped successfully, but does not emit a metric named up. Or (possibly) if it's excluded by label filters.

stoerig · 2024-04-14T11:41:59Z

This error also appears to be emitted when the endpoint gets scraped successfully, but does not emit a metric named up. Or (possibly) if it's excluded by label filters.

Is there any solution to his? I have tried filtering with params to the scrape request to the metrics endpoint and filtering by ignoring the up metric, but nothing has solved the issue.

Does anyone know what can be done to fix this issue?

Don't know if it's relevant but I am attempting to scrape an eventstore db node.

My attempted configs:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-job'
          static_configs:
            - targets: ['localhost:9090']  # Replace with your actual target
          params:
            match[]:
              - '{__name__!~"up"}'  # Exclude metrics with the name "up"

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-job'
          static_configs:
            - targets: ['localhost:9090']  # Replace with your actual target
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: 'up'
              action: drop

andrewhsu added good first issue Good for newcomers priority:p3 Lowest release:after-ga area:receiver spec:metrics labels Jan 14, 2021

yyyogev closed this as completed Mar 8, 2021

yyyogev reopened this Mar 8, 2021

bogdandrutu added the area:prometheus label Mar 10, 2021

odeke-em mentioned this issue Apr 10, 2021

receiver/prometheus: propagate Prometheus.Debug error values into .Warn for easy display #2906

Closed

alolita assigned odeke-em May 24, 2021

Aneurysm9 added the wontfix This will not be worked on label Aug 24, 2021

Aneurysm9 closed this as completed Aug 24, 2021

dashpole mentioned this issue Dec 20, 2021

prometheus receiver: more info about scrape error open-telemetry/opentelemetry-collector-contrib#6900

Closed

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

Address dependencies for golang.org/x/crypto (open-telemetry#2364)

5d32f7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error message for prometheus scrape failure #2364

Improve error message for prometheus scrape failure #2364

yyyogev commented Jan 13, 2021 •

edited

Loading

plazma-prizma commented Jan 22, 2021

hossain-rayhan commented Feb 22, 2021

dashpole commented Feb 24, 2021

hossain-rayhan commented Feb 24, 2021

hossain-rayhan commented Feb 24, 2021 •

edited

Loading

dashpole commented Feb 24, 2021

glaenen commented Mar 2, 2021 •

edited

Loading

naseemkullah commented Mar 5, 2021

nilebox commented Mar 7, 2021

odeke-em commented Apr 10, 2021

anuraaga commented May 8, 2021

odeke-em commented Aug 7, 2021

ringerc commented Nov 15, 2023 •

edited

Loading

stoerig commented Apr 14, 2024

Improve error message for prometheus scrape failure #2364

Improve error message for prometheus scrape failure #2364

Comments

yyyogev commented Jan 13, 2021 • edited Loading

plazma-prizma commented Jan 22, 2021

hossain-rayhan commented Feb 22, 2021

dashpole commented Feb 24, 2021

hossain-rayhan commented Feb 24, 2021

hossain-rayhan commented Feb 24, 2021 • edited Loading

dashpole commented Feb 24, 2021

glaenen commented Mar 2, 2021 • edited Loading

naseemkullah commented Mar 5, 2021

nilebox commented Mar 7, 2021

odeke-em commented Apr 10, 2021

anuraaga commented May 8, 2021

odeke-em commented Aug 7, 2021

ringerc commented Nov 15, 2023 • edited Loading

stoerig commented Apr 14, 2024

yyyogev commented Jan 13, 2021 •

edited

Loading

hossain-rayhan commented Feb 24, 2021 •

edited

Loading

glaenen commented Mar 2, 2021 •

edited

Loading

ringerc commented Nov 15, 2023 •

edited

Loading