-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change error handling when scraping metrics #551
Conversation
@ndhanushkodi, interested in your thoughts on this. (see also exporter guidelines: https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior looks quite nice to me. Just thinking out loud, could it make sense to have a metric similar to consul_metrics_merging_service_metrics_success
for the envoy? But, I think it makes more sense what you're doing here so that the user sees the Prometheus scrape 500/404 from Envoy rather than a successful metrics scraping when something is definitely configured wrong.
I think all of the changes you've listed makes sense.
subcommand/consul-sidecar/command.go
Outdated
envoyMetricsAddr = "http://127.0.0.1:19000/stats/prometheus" | ||
// prometheusServiceMetricsSuccessKey is the key of the prometheus metrics used to | ||
// indicate if service metrics were scraped successfully. | ||
prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success" | |
prometheusServiceMetricsSuccessKey = "consul_merged_service_metrics_success" |
This was just the only idea I could come up with to shorten the name of this metric and still be specific, feel free to take it or not.
a109f30
to
943cdca
Compare
943cdca
to
b30129b
Compare
I think that that metric now comes out of us returning a 500/200 depending on what happened. I'm sure there's a way to track that in prometheus? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Will rebase changelog after approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for fixing this!!
* If Envoy returns an error then also respond with a 500 in our merged metrics response so that Prometheus will know that we had an error, not that there are no metrics. * If the service metrics return with a non-2xx status code then don't include the response body in the merged metrics. This will stop issues where users accidentally turn on metrics merging but they don't have an exporter and so their metrics endpoint returns 404. I could have responded with a 500 in this case in order to indicate that there is an error, however I think it's more likely that users are accidentally turning on metrics merging and the error indication is accomplished via a new metric (see below). * Append a new metric that indicates the success of the service scraping. This can be used for alerting by users since the response code of the service metrics response is discarded: * success: consul_metrics_merging_service_metrics_success 1 * fail: consul_metrics_merging_service_metrics_success 0 * modify logging to use key/value pairs * Fixes #546
b7858ce
to
4b9963e
Compare
Fixes some issues with metrics merging. Namely that if there were any errors getting metrics from Envoy then we'd return a 200 with empty metrics instead of a 500 and that if the service returned an error, e.g. 400, we'd just stick the body of its response in our metrics output which would break Prometheus.
metrics response so that Prometheus will know that we had an error, not
that there are no metrics.
include the response body in the merged metrics. This will stop issues
where users accidentally turn on metrics merging but they don't have an
exporter and so their metrics endpoint returns 404. I could have
responded with a 500 in this case in order to indicate that there is an
error, however I think it's more likely that users are accidentally
turning on metrics merging and the error indication is accomplished via
a new metric (see below).
scraping. This can be used for alerting by users since the response code
of the service metrics response is discarded:
How I've tested this PR:
How I expect reviewers to test this PR:
Checklist: