-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Prometheus to scrape more MCM metrics #948
Updated Prometheus to scrape more MCM metrics #948
Conversation
The following is the list of new metrics that Prometheus scrape's from MCM - mcm_machine_controller_frozen - mcm_machine_items_total - mcm_machineset_items_total - mcm_machine_set_failed_machines - mcm_machinedeployment_items_total - mcm_machine_deployment_failed_machines - mcm_cloud_api_requests_total - mcm_cloud_api_requests_failed_total - mcm_scrape_failure_total
I shall be creating a subsequent PR to display these metrics onto the dashboard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for integrating the mcm metrics! One small comment:
mcm_machine_set_failed_machines
, mcm_machine_deployment_failed_machines
, mcm_cloud_api_requests_failed_total
, and mcm_scrape_failure_total
should not be missing if there are no failed machines, deployments, etc. They should return 0
if there are no failures. Missing metrics can lead to issues with alerting.
Hi @wyb1 , Right now I think they do return blank values. If that's the case, i should perhaps avoid metrics that return blank values for now. And probably integrate them later once i have a new MCM release with corrected values for those? |
And also is this an issue only for alerting and not for Prometheus scraping? If that's the case, I can update MCM (with a new release) to return 0 before I integrate this into alerting? WDYT? |
@prashanth26 Prometheus won't have an issue scraping the metrics, but it can cause problems with alerting or if you create dashboards (either the query will break or the dashboard will have "gaps" in the data). In general Prometheus best practice is to avoid missing metrics whenever possible. See this blog post. |
@wyb1 Think you are right. But for now they want to have the ability to query for the whitelisted metrics. As Prashanth has already announced to provide a subsequent PR with a dashboard, then there will be indeed a need to integrate these metrics to have proper dashboards and alerts :) Thank you. |
Okay, then we can merge and once any dashboards/alerts are based on these metrics they should no longer be "missing". |
@wyb1 Thanks for your feedback. I shall make the suggested changes by you at the MCM before integrating the dashboard/alerting. In the meanwhile, these currently exposed metrics should help us with watching the cloud provider failures/rate-limits sort of issues, which is the reason for me creating this PR. |
What this PR does / why we need it:
The following is the list of new metrics that Prometheus scrape's from MCM
Which issue(s) this PR fixes:
Partially gardener/machine-controller-manager#211
Special notes for your reviewer:
I can see that a couple of MCM metrics are already scraped by the Prometheus. So I have tried to add a few more. This is my first time trying to work with metrics, so please double check that I am doing the right thing.
Kindly take a look at the metrics currently MCM exposes here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/metrics/metrics.go.
Release note: