Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Prometheus to scrape more MCM metrics #948

Conversation

prashanth26
Copy link

@prashanth26 prashanth26 commented Apr 23, 2019

What this PR does / why we need it:
The following is the list of new metrics that Prometheus scrape's from MCM

  • mcm_machine_controller_frozen
  • mcm_machine_items_total
  • mcm_machineset_items_total
  • mcm_machine_set_failed_machines
  • mcm_machinedeployment_items_total
  • mcm_machine_deployment_failed_machines
  • mcm_cloud_api_requests_total
  • mcm_cloud_api_requests_failed_total
  • mcm_scrape_failure_total

Which issue(s) this PR fixes:
Partially gardener/machine-controller-manager#211

Special notes for your reviewer:
I can see that a couple of MCM metrics are already scraped by the Prometheus. So I have tried to add a few more. This is my first time trying to work with metrics, so please double check that I am doing the right thing.

Kindly take a look at the metrics currently MCM exposes here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/metrics/metrics.go.

Release note:

MCM metrics are exposed to the Prometheus

The following is the list of new metrics that Prometheus scrape's from MCM
  - mcm_machine_controller_frozen
  - mcm_machine_items_total
  - mcm_machineset_items_total
  - mcm_machine_set_failed_machines
  - mcm_machinedeployment_items_total
  - mcm_machine_deployment_failed_machines
  - mcm_cloud_api_requests_total
  - mcm_cloud_api_requests_failed_total
  - mcm_scrape_failure_total
@prashanth26
Copy link
Author

I shall be creating a subsequent PR to display these metrics onto the dashboard.

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

@wyb1 wyb1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for integrating the mcm metrics! One small comment:
mcm_machine_set_failed_machines, mcm_machine_deployment_failed_machines, mcm_cloud_api_requests_failed_total, and mcm_scrape_failure_total should not be missing if there are no failed machines, deployments, etc. They should return 0 if there are no failures. Missing metrics can lead to issues with alerting.

@prashanth26
Copy link
Author

Hi @wyb1 ,

Right now I think they do return blank values. If that's the case, i should perhaps avoid metrics that return blank values for now. And probably integrate them later once i have a new MCM release with corrected values for those?

@prashanth26
Copy link
Author

prashanth26 commented Apr 23, 2019

And also is this an issue only for alerting and not for Prometheus scraping? If that's the case, I can update MCM (with a new release) to return 0 before I integrate this into alerting? WDYT?

@wyb1
Copy link
Contributor

wyb1 commented Apr 23, 2019

@prashanth26 Prometheus won't have an issue scraping the metrics, but it can cause problems with alerting or if you create dashboards (either the query will break or the dashboard will have "gaps" in the data). In general Prometheus best practice is to avoid missing metrics whenever possible. See this blog post.

@dkistner
Copy link
Member

@wyb1 Think you are right. But for now they want to have the ability to query for the whitelisted metrics. As Prashanth has already announced to provide a subsequent PR with a dashboard, then there will be indeed a need to integrate these metrics to have proper dashboards and alerts :) Thank you.

@wyb1
Copy link
Contributor

wyb1 commented Apr 23, 2019

Okay, then we can merge and once any dashboards/alerts are based on these metrics they should no longer be "missing".

@prashanth26
Copy link
Author

prashanth26 commented Apr 23, 2019

@wyb1 Thanks for your feedback. I shall make the suggested changes by you at the MCM before integrating the dashboard/alerting.

In the meanwhile, these currently exposed metrics should help us with watching the cloud provider failures/rate-limits sort of issues, which is the reason for me creating this PR.

@rfranzke rfranzke merged commit b9bea3a into gardener:master Apr 23, 2019
@prashanth26 prashanth26 deleted the enhancement/expose-mcm-metrics-to-prometheus branch April 24, 2019 05:51
ialidzhikov pushed a commit to ialidzhikov/gardener that referenced this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants