Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error message for prometheus scrape failure #2364

Closed
yyyogev opened this issue Jan 13, 2021 · 14 comments
Closed

Improve error message for prometheus scrape failure #2364

yyyogev opened this issue Jan 13, 2021 · 14 comments
Assignees
Labels
area:receiver good first issue Good for newcomers priority:p3 Lowest release:after-ga wontfix This will not be worked on

Comments

@yyyogev
Copy link

yyyogev commented Jan 13, 2021

If a prometheus receiver, fails to scrape an endpoint, the output shows something like:

2021-01-12T21:11:10.496Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1610485870495, "target_labels": "map[instance:aks-agentpool-30099824-vmss000002 job:cadvisor]"}

The reason for the failure isn't presented here and it's almost impossible to figure out what is wrong

I would suggest enriching the warn message with an actual reason

@plazma-prizma
Copy link

Hello, I'd like to pick it up as my first contribution to open telemetry 🙂 Though I need some initial guidance, quick search for metricsbuilder in this repository didn't help 😬

@hossain-rayhan
Copy link
Contributor

Facing similar issue. I have no idea whats going wrong. In my case, it works once when I scrape the Pods but never worked when I am trying to scrape /mectrics/cadvisor.

WARN internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614018163539, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-0cd2e741621ad97d0 alpha_eksctl_io_nodegroup_name:ng-2-builders beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.2xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-181-249.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-181-249.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.2xlarge node_lifecycle:on-demand role:builders topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}

Here is the full kubernetes_sd_congig:

kubernetes_sd_configs:
            - role: node
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
                # Only for Kubernetes ^1.7.3.
                # See: https://github.com/prometheus/prometheus/issues/2916
              - target_label: __address__
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
            metric_relabel_configs:
              - action: replace
                source_labels: [id]
                regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
                target_label: rkt_container_name
                replacement: '${2}-${1}'
              - action: replace
                source_labels: [id]
                regex: '^/system\.slice/(.+)\.service$'
                target_label: systemd_service_name
                replacement: '${1}'

@dashpole
Copy link
Contributor

can you try running with the command line flag
--log-level=DEBUG?

When I do that with an intentionally misconfigured collector, I see messages like:

2021-02-24T18:57:16.040Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614193036037, "target_labels": "map[beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:e2-medium beta_kubernetes_io_os:linux cloud_google_com_gke_boot_disk:pd-standard cloud_google_com_gke_nodepool:default-pool cloud_google_com_gke_os_distribution:cos cloud_google_com_machine_family:e2 failure_domain_beta_kubernetes_io_region:us-central1 failure_domain_beta_kubernetes_io_zone:us-central1-c instance:gke-cluster-1-default-pool-746cd6d9-mziw job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:gke-cluster-1-default-pool-746cd6d9-mziw kubernetes_io_os:linux node_kubernetes_io_instance_type:e2-medium topology_kubernetes_io_region:us-central1 topology_kubernetes_io_zone:us-central1-c]"}
2021-02-24T18:57:16.478Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://10.128.15.217:10250/metrics/cadvisor/foo", "err": "server returned HTTP status 404 Not Found", "errVerbose": "server returned HTTP status 404 Not Found\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

The debug log shows me that my problem was a 404, which is correct for me since I put in an incorrect metrics_path in my prometheus config.

@hossain-rayhan
Copy link
Contributor

@dashpole thanks, it gives more insights.

I am getting 403 Forbidden. Not sure which permission I missed.

2021-02-24T19:18:01.859Z	warn	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1614194281848, "target_labels": "map[alpha_eksctl_io_cluster_name:eks-test-1 alpha_eksctl_io_instance_id:i-09892500d4bf9388b alpha_eksctl_io_nodegroup_name:ng-1-workers beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:m5.xlarge beta_kubernetes_io_os:linux failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2a instance:ip-192-168-173-241.us-east-2.compute.internal job:kubernetes-cadvisor kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-192-168-173-241.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:m5.xlarge node_lifecycle:on-demand role:workers topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2a]"}
2021-02-24T19:18:02.105Z	debug	scrape/scrape.go:1124	Scrape failed	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "kubernetes-cadvisor", "target": "https://192.168.125.115:10250/metrics/cadvisor", "err": "server returned HTTP status 403 Forbidden", "errVerbose": "server returned HTTP status 403 Forbidden\ngithub.com/prometheus/prometheus/scrape.(*targetScraper).scrape\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:641\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1112\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\t/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/scrape/scrape.go:1036\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"}

@hossain-rayhan
Copy link
Contributor

hossain-rayhan commented Feb 24, 2021

I resolved my issue. I was missing permission for - nodes/metrics. After adding this in my ClusterRole it worked for me.

One thing, maybe we can upgrade the label from debug to warn for detailed failing reason.

@dashpole
Copy link
Contributor

The line in question: in scrape/scrape.go is in the prometheus server, so without forking it we can't modify the level. The same question was raised in upstream prometheus here: prometheus/prometheus#2820 (comment), and it wasn't deemed appropriate to raise the verbosity

@glaenen
Copy link

glaenen commented Mar 2, 2021

node/metrics

I resolved my issue. I was missing permission for - node/metrics. After adding this in my ClusterRole it worked for me.

One thing, maybe we can upgrade the label from debug to warn for detailed failing reason.

Thanks pointing me in the right direction.
Small correction, the missing permission is "nodes/metrics"

@naseemkullah
Copy link
Member

@nilebox I see you've worked on the prometheus receiver logging, would you have any idea how to wait for the subsequent debug log (coming from prometheus server) and put its err field into AddDataPoint's warn log?

@nilebox
Copy link
Member

nilebox commented Mar 7, 2021

any idea how to wait for the subsequent debug log

@naseemkullah in

func (w *zapToGokitLogAdapter) Log(keyvals ...interface{}) error {
, we get key-value pairs from go-kit (used by Prometheus) for each log message, which gets parsed and forwarded to zap logger.

You may change the code to catch specific errors there (e.g. based on known error message), and convert it to warn log message in zap.

P.S. I'm no longer working on this project, so won't be able to help you beyond this advice unfortunately.

@yyyogev yyyogev closed this as completed Mar 8, 2021
@yyyogev yyyogev reopened this Mar 8, 2021
odeke-em added a commit to orijtech/opentelemetry-collector that referenced this issue Apr 10, 2021
…rn for easy display

This change transforms Prometheus created .Debug level errors such as
failed scrape message reasons into a level that be displayed to
collector users, without them having to use --log-level=DEBUG.

In 2017, a Prometheus PR prometheus/prometheus#3135
added the failure reason displays with a .Debug level.

This change now ensures that a Prometheus log that's routed from
say a scrape failure that was logged originally from Prometheus as:

    2021-04-09T22:58:51.732-0700	debug	scrape/scrape.go:1127
    Scrape failed	{"kind": "receiver", "name": "prometheus",
    "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics",
    "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"}

will now get transformed to:

    2021-04-09T23:24:41.733-0700	warn	internal/metricsbuilder.go:104
    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus",
    "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"}

which will now be surfaced to users.

Fixes open-telemetry#2364
odeke-em added a commit to orijtech/opentelemetry-collector that referenced this issue Apr 10, 2021
…rn for easy display

This change transforms Prometheus created .Debug level errors such as
failed scrape message reasons into a level that be displayed to
collector users, without them having to use --log-level=DEBUG.

In 2017, a Prometheus PR prometheus/prometheus#3135
added the failure reason displays with a .Debug level.

This change now ensures that a Prometheus log that's routed from
say a scrape failure that was logged originally from Prometheus as:

    2021-04-09T22:58:51.732-0700	debug	scrape/scrape.go:1127
    Scrape failed	{"kind": "receiver", "name": "prometheus",
    "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics",
    "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"}

will now get transformed to:

    2021-04-09T23:24:41.733-0700	warn	internal/metricsbuilder.go:104
    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus",
    "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"}

which will now be surfaced to users.

Fixes open-telemetry#2364
@odeke-em
Copy link
Member

I've mailed out a fix for it in #2906

odeke-em added a commit to orijtech/opentelemetry-collector that referenced this issue Apr 12, 2021
…rn for easy display

This change transforms Prometheus created .Debug level errors such as
failed scrape message reasons into a level that be displayed to
collector users, without them having to use --log-level=DEBUG.

In 2017, a Prometheus PR prometheus/prometheus#3135
added the failure reason displays with a .Debug level.

This change now ensures that a Prometheus log that's routed from
say a scrape failure that was logged originally from Prometheus as:

    2021-04-09T22:58:51.732-0700	debug	scrape/scrape.go:1127
    Scrape failed	{"kind": "receiver", "name": "prometheus",
    "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics",
    "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"}

will now get transformed to:

    2021-04-09T23:24:41.733-0700	warn	internal/metricsbuilder.go:104
    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus",
    "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"}

which will now be surfaced to users.

Fixes open-telemetry#2364
@anuraaga
Copy link
Contributor

anuraaga commented May 8, 2021

AFAIK logging detailed information at debug level is consistent with the rest of the collector so not sure this change is a good idea. Could be wrong though. @bogdandrutu @tigrannajaryan could you clarify any best practices for logging in collector and whether it makes sense to use a different default than prometheus here?

@odeke-em
Copy link
Member

odeke-em commented Aug 7, 2021

Yeah from discussions in the Prometheus Working Group and other discussions, we came to the conclusion that we shouldn't do this. @alolita @Aneurysm9 could you please help perhaps mark as this a "Won't Fix" and then we can close it? Thank you.

@ringerc
Copy link

ringerc commented Nov 15, 2023

This error also appears to be emitted when the endpoint gets scraped successfully, but does not emit a metric named up. Or (possibly) if it's excluded by label filters.

@stoerig
Copy link

stoerig commented Apr 14, 2024

This error also appears to be emitted when the endpoint gets scraped successfully, but does not emit a metric named up. Or (possibly) if it's excluded by label filters.

Is there any solution to his? I have tried filtering with params to the scrape request to the metrics endpoint and filtering by ignoring the up metric, but nothing has solved the issue.

Does anyone know what can be done to fix this issue?

Don't know if it's relevant but I am attempting to scrape an eventstore db node.

My attempted configs:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-job'
          static_configs:
            - targets: ['localhost:9090']  # Replace with your actual target
          params:
            match[]:
              - '{__name__!~"up"}'  # Exclude metrics with the name "up"
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-job'
          static_configs:
            - targets: ['localhost:9090']  # Replace with your actual target
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: 'up'
              action: drop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:receiver good first issue Good for newcomers priority:p3 Lowest release:after-ga wontfix This will not be worked on
Projects
None yet