-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error message for prometheus scrape failure #2364
Comments
Hello, I'd like to pick it up as my first contribution to open telemetry 🙂 Though I need some initial guidance, quick search for |
Facing similar issue. I have no idea whats going wrong. In my case, it works once when I scrape the
Here is the full kubernetes_sd_congig: kubernetes_sd_configs:
- role: node
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Only for Kubernetes ^1.7.3.
# See: https://github.com/prometheus/prometheus/issues/2916
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- action: replace
source_labels: [id]
regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
target_label: rkt_container_name
replacement: '${2}-${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.service$'
target_label: systemd_service_name
replacement: '${1}' |
can you try running with the command line flag When I do that with an intentionally misconfigured collector, I see messages like:
The debug log shows me that my problem was a 404, which is correct for me since I put in an incorrect metrics_path in my prometheus config. |
@dashpole thanks, it gives more insights. I am getting
|
I resolved my issue. I was missing permission for One thing, maybe we can upgrade the label from |
The line in question: in scrape/scrape.go is in the prometheus server, so without forking it we can't modify the level. The same question was raised in upstream prometheus here: prometheus/prometheus#2820 (comment), and it wasn't deemed appropriate to raise the verbosity |
Thanks pointing me in the right direction. |
@nilebox I see you've worked on the prometheus receiver logging, would you have any idea how to wait for the subsequent debug log (coming from prometheus server) and put its |
You may change the code to catch specific errors there (e.g. based on known error message), and convert it to warn log message in zap. P.S. I'm no longer working on this project, so won't be able to help you beyond this advice unfortunately. |
…rn for easy display This change transforms Prometheus created .Debug level errors such as failed scrape message reasons into a level that be displayed to collector users, without them having to use --log-level=DEBUG. In 2017, a Prometheus PR prometheus/prometheus#3135 added the failure reason displays with a .Debug level. This change now ensures that a Prometheus log that's routed from say a scrape failure that was logged originally from Prometheus as: 2021-04-09T22:58:51.732-0700 debug scrape/scrape.go:1127 Scrape failed {"kind": "receiver", "name": "prometheus", "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics", "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"} will now get transformed to: 2021-04-09T23:24:41.733-0700 warn internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"} which will now be surfaced to users. Fixes open-telemetry#2364
…rn for easy display This change transforms Prometheus created .Debug level errors such as failed scrape message reasons into a level that be displayed to collector users, without them having to use --log-level=DEBUG. In 2017, a Prometheus PR prometheus/prometheus#3135 added the failure reason displays with a .Debug level. This change now ensures that a Prometheus log that's routed from say a scrape failure that was logged originally from Prometheus as: 2021-04-09T22:58:51.732-0700 debug scrape/scrape.go:1127 Scrape failed {"kind": "receiver", "name": "prometheus", "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics", "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"} will now get transformed to: 2021-04-09T23:24:41.733-0700 warn internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"} which will now be surfaced to users. Fixes open-telemetry#2364
I've mailed out a fix for it in #2906 |
…rn for easy display This change transforms Prometheus created .Debug level errors such as failed scrape message reasons into a level that be displayed to collector users, without them having to use --log-level=DEBUG. In 2017, a Prometheus PR prometheus/prometheus#3135 added the failure reason displays with a .Debug level. This change now ensures that a Prometheus log that's routed from say a scrape failure that was logged originally from Prometheus as: 2021-04-09T22:58:51.732-0700 debug scrape/scrape.go:1127 Scrape failed {"kind": "receiver", "name": "prometheus", "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics", "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"} will now get transformed to: 2021-04-09T23:24:41.733-0700 warn internal/metricsbuilder.go:104 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"} which will now be surfaced to users. Fixes open-telemetry#2364
AFAIK logging detailed information at debug level is consistent with the rest of the collector so not sure this change is a good idea. Could be wrong though. @bogdandrutu @tigrannajaryan could you clarify any best practices for logging in collector and whether it makes sense to use a different default than prometheus here? |
Yeah from discussions in the Prometheus Working Group and other discussions, we came to the conclusion that we shouldn't do this. @alolita @Aneurysm9 could you please help perhaps mark as this a "Won't Fix" and then we can close it? Thank you. |
This error also appears to be emitted when the endpoint gets scraped successfully, but does not emit a metric named |
Is there any solution to his? I have tried filtering with params to the scrape request to the metrics endpoint and filtering by ignoring the up metric, but nothing has solved the issue. Does anyone know what can be done to fix this issue? Don't know if it's relevant but I am attempting to scrape an eventstore db node. My attempted configs: receivers:
prometheus:
config:
scrape_configs:
- job_name: 'my-job'
static_configs:
- targets: ['localhost:9090'] # Replace with your actual target
params:
match[]:
- '{__name__!~"up"}' # Exclude metrics with the name "up" receivers:
prometheus:
config:
scrape_configs:
- job_name: 'my-job'
static_configs:
- targets: ['localhost:9090'] # Replace with your actual target
metric_relabel_configs:
- source_labels: [__name__]
regex: 'up'
action: drop |
If a prometheus receiver, fails to scrape an endpoint, the output shows something like:
The reason for the failure isn't presented here and it's almost impossible to figure out what is wrong
I would suggest enriching the warn message with an actual reason
The text was updated successfully, but these errors were encountered: