Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receiver/prometheus: propagate Prometheus.Debug error values into .Warn for easy display #2906

Conversation

odeke-em
Copy link
Member

This change transforms Prometheus created .Debug level errors such as
failed scrape message reasons into a level that be displayed to
collector users, without them having to use --log-level=DEBUG.

In 2017, a Prometheus PR prometheus/prometheus#3135
added the failure reason displays with a .Debug level.

This change now ensures that a Prometheus log that's routed from
say a scrape failure that was logged originally from Prometheus as:

2021-04-09T22:58:51.732-0700	debug	scrape/scrape.go:1127
Scrape failed	{"kind": "receiver", "name": "prometheus",
"scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics",
"err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"}

will now get transformed to:

2021-04-09T23:24:41.733-0700	warn	internal/metricsbuilder.go:104
Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus",
"scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"}

which will now be surfaced to users.

Fixes #2364

@odeke-em odeke-em requested a review from a team April 10, 2021 06:33
@odeke-em odeke-em force-pushed the receiver-prometheus-transform-Prometheus-log.Debug-with-errors-into-level.Warn branch from 0462141 to 85585ce Compare April 10, 2021 06:33
@odeke-em
Copy link
Member Author

Kindly cc-ing @bogdandrutu @Aneurysm9 @rakyll @alolita

@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Merging #2906 (b9818ee) into main (a31ad5b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2906      +/-   ##
==========================================
+ Coverage   91.67%   91.69%   +0.01%     
==========================================
  Files         287      287              
  Lines       15250    15254       +4     
==========================================
+ Hits        13981    13987       +6     
+ Misses        866      863       -3     
- Partials      403      404       +1     
Impacted Files Coverage Δ
receiver/prometheusreceiver/internal/logger.go 86.79% <100.00%> (+5.15%) ⬆️
testutil/testutil.go 81.60% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a31ad5b...b9818ee. Read the comment docs.

Copy link
Member

@Aneurysm9 Aneurysm9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that it is appropriate to rewrite the level from debug towarn. I share the concerns expressed upstream with regard to log volume. I think this approach also results in all debug level log entries that carry an error becoming warn, which may be more than is intended.

receiver/prometheusreceiver/internal/logger_test.go Outdated Show resolved Hide resolved
…rn for easy display

This change transforms Prometheus created .Debug level errors such as
failed scrape message reasons into a level that be displayed to
collector users, without them having to use --log-level=DEBUG.

In 2017, a Prometheus PR prometheus/prometheus#3135
added the failure reason displays with a .Debug level.

This change now ensures that a Prometheus log that's routed from
say a scrape failure that was logged originally from Prometheus as:

    2021-04-09T22:58:51.732-0700	debug	scrape/scrape.go:1127
    Scrape failed	{"kind": "receiver", "name": "prometheus",
    "scrape_pool": "otel-collector", "target": "http://0.0.0.0:9999/metrics",
    "err": "Get \"http://0.0.0.0:9999/metrics\": dial tcp 0.0.0.0:9999: connect: connection refused"}

will now get transformed to:

    2021-04-09T23:24:41.733-0700	warn	internal/metricsbuilder.go:104
    Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus",
    "scrape_timestamp": 1618035881732, "target_labels": "map[instance:0.0.0.0:9999 job:otel-collector]"}

which will now be surfaced to users.

Fixes open-telemetry#2364
@odeke-em odeke-em force-pushed the receiver-prometheus-transform-Prometheus-log.Debug-with-errors-into-level.Warn branch from 85585ce to b9818ee Compare April 12, 2021 18:35
@odeke-em
Copy link
Member Author

Thank you for the response @Aneurysm9!

I'm not convinced that it is appropriate to rewrite the level from debug towarn. I share the concerns expressed upstream with regard to log volume. I think this approach also results in all debug level log entries that carry an error becoming warn, which may be more than is intended.

I don’t see any other way to inform users of instance related failures without turning on the floodgates using —log-level=DEBUG. Shouldn’t debug level entries with an error instead turn into a WARN? How might you propose this gets fixed? Should we just tell the users that there is nothing that we can do? I’d be more inclined to ask Prometheus to have that as a WARN, but that’s an uphill task of its own. Please let me know what you think.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@alolita
Copy link
Member

alolita commented May 7, 2021

@anuraaga @open-telemetry/collector-approvers can you please review and unblock this PR?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 15, 2021
@anuraaga
Copy link
Contributor

Agree with @Aneurysm9 we can't really convert debug logs like this as we don't have control of what logging upstream does. A more targeted approach of identifying specific messages (e.g., auth failure) could be ok but in general, collector doesn't log much at default, when something isn't working we generally expect temporarily enabling log-level=debug, fix the issue, and, restore log-level.

@bogdandrutu
Copy link
Member

Mark as waiting for author so @alolita does see that this is not blocked by reviewers.

@Aneurysm9 @rakyll @anuraaga please review and make a decision.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 27, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2021

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 3, 2021
@odeke-em odeke-em deleted the receiver-prometheus-transform-Prometheus-log.Debug-with-errors-into-level.Warn branch July 29, 2021 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error message for prometheus scrape failure
5 participants