Updated Prometheus to scrape more MCM metrics #948

prashanth26 · 2019-04-23T10:23:58Z

What this PR does / why we need it:
The following is the list of new metrics that Prometheus scrape's from MCM

mcm_machine_controller_frozen
mcm_machine_items_total
mcm_machineset_items_total
mcm_machine_set_failed_machines
mcm_machinedeployment_items_total
mcm_machine_deployment_failed_machines
mcm_cloud_api_requests_total
mcm_cloud_api_requests_failed_total
mcm_scrape_failure_total

Which issue(s) this PR fixes:
Partially gardener/machine-controller-manager#211

Special notes for your reviewer:
I can see that a couple of MCM metrics are already scraped by the Prometheus. So I have tried to add a few more. This is my first time trying to work with metrics, so please double check that I am doing the right thing.

Kindly take a look at the metrics currently MCM exposes here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/metrics/metrics.go.

Release note:

MCM metrics are exposed to the Prometheus

The following is the list of new metrics that Prometheus scrape's from MCM - mcm_machine_controller_frozen - mcm_machine_items_total - mcm_machineset_items_total - mcm_machine_set_failed_machines - mcm_machinedeployment_items_total - mcm_machine_deployment_failed_machines - mcm_cloud_api_requests_total - mcm_cloud_api_requests_failed_total - mcm_scrape_failure_total

prashanth26 · 2019-04-23T10:38:16Z

I shall be creating a subsequent PR to display these metrics onto the dashboard.

rfranzke

/lgtm

dkistner

/lgtm

wyb1

Thanks for integrating the mcm metrics! One small comment:
mcm_machine_set_failed_machines, mcm_machine_deployment_failed_machines, mcm_cloud_api_requests_failed_total, and mcm_scrape_failure_total should not be missing if there are no failed machines, deployments, etc. They should return 0 if there are no failures. Missing metrics can lead to issues with alerting.

prashanth26 · 2019-04-23T10:50:07Z

Hi @wyb1 ,

Right now I think they do return blank values. If that's the case, i should perhaps avoid metrics that return blank values for now. And probably integrate them later once i have a new MCM release with corrected values for those?

prashanth26 · 2019-04-23T11:00:57Z

And also is this an issue only for alerting and not for Prometheus scraping? If that's the case, I can update MCM (with a new release) to return 0 before I integrate this into alerting? WDYT?

wyb1 · 2019-04-23T11:06:30Z

@prashanth26 Prometheus won't have an issue scraping the metrics, but it can cause problems with alerting or if you create dashboards (either the query will break or the dashboard will have "gaps" in the data). In general Prometheus best practice is to avoid missing metrics whenever possible. See this blog post.

dkistner · 2019-04-23T11:10:49Z

@wyb1 Think you are right. But for now they want to have the ability to query for the whitelisted metrics. As Prashanth has already announced to provide a subsequent PR with a dashboard, then there will be indeed a need to integrate these metrics to have proper dashboards and alerts :) Thank you.

wyb1 · 2019-04-23T11:12:28Z

Okay, then we can merge and once any dashboards/alerts are based on these metrics they should no longer be "missing".

prashanth26 · 2019-04-23T11:14:11Z

@wyb1 Thanks for your feedback. I shall make the suggested changes by you at the MCM before integrating the dashboard/alerting.

In the meanwhile, these currently exposed metrics should help us with watching the cloud provider failures/rate-limits sort of issues, which is the reason for me creating this PR.

prashanth26 requested a review from a team as a code owner April 23, 2019 10:23

gardener-robot-ci-1 added reviewed/ok-to-test and removed reviewed/ok-to-test labels Apr 23, 2019

rfranzke approved these changes Apr 23, 2019

View reviewed changes

dkistner approved these changes Apr 23, 2019

View reviewed changes

wyb1 reviewed Apr 23, 2019

View reviewed changes

rfranzke merged commit b9bea3a into gardener:master Apr 23, 2019

prashanth26 deleted the enhancement/expose-mcm-metrics-to-prometheus branch April 24, 2019 05:51

prashanth26 mentioned this pull request Apr 24, 2019

Improve Monitoring/Alerting/Metrics gardener/machine-controller-manager#211

Open

7 tasks

ialidzhikov pushed a commit to ialidzhikov/gardener that referenced this pull request Aug 1, 2024

register dashboard doc component (gardener#948)

3eab68e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Prometheus to scrape more MCM metrics #948

Updated Prometheus to scrape more MCM metrics #948

prashanth26 commented Apr 23, 2019 •

edited by rfranzke

Loading

prashanth26 commented Apr 23, 2019

rfranzke left a comment

dkistner left a comment

wyb1 left a comment

prashanth26 commented Apr 23, 2019

prashanth26 commented Apr 23, 2019 •

edited

Loading

wyb1 commented Apr 23, 2019

dkistner commented Apr 23, 2019

wyb1 commented Apr 23, 2019

prashanth26 commented Apr 23, 2019 •

edited

Loading

Updated Prometheus to scrape more MCM metrics #948

Updated Prometheus to scrape more MCM metrics #948

Conversation

prashanth26 commented Apr 23, 2019 • edited by rfranzke Loading

prashanth26 commented Apr 23, 2019

rfranzke left a comment

Choose a reason for hiding this comment

dkistner left a comment

Choose a reason for hiding this comment

wyb1 left a comment

Choose a reason for hiding this comment

prashanth26 commented Apr 23, 2019

prashanth26 commented Apr 23, 2019 • edited Loading

wyb1 commented Apr 23, 2019

dkistner commented Apr 23, 2019

wyb1 commented Apr 23, 2019

prashanth26 commented Apr 23, 2019 • edited Loading

prashanth26 commented Apr 23, 2019 •

edited by rfranzke

Loading

prashanth26 commented Apr 23, 2019 •

edited

Loading

prashanth26 commented Apr 23, 2019 •

edited

Loading