Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric to track Unhealthy Stale machine removal count #808

Merged
merged 6 commits into from
Jun 6, 2023

Conversation

jguipi
Copy link
Contributor

@jguipi jguipi commented Apr 17, 2023

What this PR does / why we need it:
There are two kinds of stale machine which machineSet controller collects

  • stale due to unhealthiness
  • stale due to creation timeout

In this metric , we are interested to expose the count of stale due to unhealthiness machine which get terminated between two scrapings by prometheus.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

I have tested the code for different scenarios like:

  • deleting a machine obj directly doesn't increase count
  • health timeout machines in different machineDeployments don't cause race and update the counter properly
  • restarting mcm resets counter back to 0
  • stale machines due to creation timeout collection doesn't increase the counter

Release note:

Added a new metric that will allow to get the number of stale (due to unhealthiness) machines  that are getting terminated

@jguipi jguipi requested a review from a team as a code owner April 17, 2023 20:32
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Apr 17, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 17, 2023
@jguipi
Copy link
Contributor Author

jguipi commented Apr 17, 2023

/area monitoring
/kind enhancement

Here's a capture of the exported metric visibility :

image

cc.: @dmahmalatsap @istvanballok @rickardsjp @himanshu-kun @rishabh-11

@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Apr 17, 2023
@jguipi jguipi changed the title Stale machine count Add Stale machine count to the mcm metrics Apr 17, 2023
@himanshu-kun
Copy link
Contributor

/assign

Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many worker go-routines which reconcile the machineSets , and the current metric is per machineSet
So I don't feel this is a thread safe way of keeping count , I'll have to discuss internally and research and only then I can comment on that.
otherwise I have requested some changes.

pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jun 1, 2023
@himanshu-kun himanshu-kun force-pushed the stale-machine-count branch from 771fe34 to 824c3a2 Compare June 5, 2023 16:38
@gardener-robot gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jun 5, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 5, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 5, 2023
@himanshu-kun himanshu-kun changed the title Add Stale machine count to the mcm metrics Add metric to track Unhealthy Stale machine removal count Jun 5, 2023
@elankath
Copy link
Contributor

elankath commented Jun 5, 2023

nice work.

@rishabh-11 rishabh-11 self-assigned this Jun 5, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 5, 2023
@dmahmalatsap
Copy link

Thanks for the refactor and for keeping us in mind in the process !

@himanshu-kun himanshu-kun removed their assignment Jun 6, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 6, 2023
Copy link
Contributor

@rishabh-11 rishabh-11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. thanks for the changes.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review labels Jun 6, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 6, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
approving my own contribution to the PR :')

@himanshu-kun himanshu-kun merged commit 7a70e80 into gardener:master Jun 6, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 6, 2023
@himanshu-kun
Copy link
Contributor

Note: Since this metric is expose by MCM core, so it doesn't require to be vendored in mcm-providers to start exposing the metrics and will be available in next MCM release (v0.50.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants