Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Monitoring/Alerting/Metrics #211

Open
7 tasks
PadmaB opened this issue Jan 24, 2019 · 7 comments
Open
7 tasks

Improve Monitoring/Alerting/Metrics #211

PadmaB opened this issue Jan 24, 2019 · 7 comments
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers platform/all priority/2 Priority (lower number equals higher priority) topology/seed Affects Seed clusters

Comments

@PadmaB
Copy link
Contributor

PadmaB commented Jan 24, 2019

Story

As a provider I want timely alerts raised based on the metrics to take informed decisions

Motivation

Acceptance Criteria

  • Define alerts for the above situations to take required action

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Possible metrices to add (Rough work)

  • we could provide metrices on number of machines with different statuses , so filtering on that can be done (if already not exposed)
  • metrics about time taken for machine to join can be added, this will help to know overall average joining time on any provider
  • when MCM did scale-up , scale-down and when CA did.
  • metices that could solve typical DoD issues, like node not joining.
  • how much each resource took to get created like VM, disk especially in Azure.
@prashanth26 prashanth26 added kind/enhancement Enhancement, improvement, extension platform/all area/monitoring Monitoring (including availability monitoring and alerting) related size/s Size of pull request is small (see gardener-robot robot/bots/size.py) topology/seed Affects Seed clusters labels Feb 4, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Apr 6, 2019
@prashanth26
Copy link
Contributor

prashanth26 commented Apr 24, 2019

I have tried to at least expose a few crucial metrics into the Gardener Prometheus for now. Refer - gardener/gardener#948.

However, we will need to further enhance metrics to always return values and not return blank values (like mcm_cloud_api_requests_failed_total, mcm_cloud_api_requests_total, mcm_machine_deployment_failed_machines ) for all the metrics before trying to create a dashboard and raise alerts. Refer - gardener/gardener#948 (comment)

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 25, 2019
@gardener-robot-ci-2 gardener-robot-ci-2 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 4, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 3, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 2, 2020
@ghost ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 3, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 2, 2020
@prashanth26 prashanth26 removed the lifecycle/rotten Nobody worked on this for 12 months (final aging stage) label Aug 13, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 13, 2020
@prashanth26
Copy link
Contributor

/touch
/priority critical

@gardener-robot gardener-robot added priority/critical Needs to be resolved soon, because it impacts users negatively and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 29, 2020
@hardikdr hardikdr changed the title Raise appropriate alerts based on the MCM metrics Improve monitoring and alerting for worker-machines. Nov 6, 2020
@hardikdr hardikdr added this to the 2021-Q2 milestone Nov 6, 2020
@prashanth26 prashanth26 added priority/3 Priority (lower number equals higher priority) effort/1m Effort for issue is around 1 month and removed priority/2 Priority (lower number equals higher priority) effort/2d Effort for issue is around 2 days labels Mar 30, 2021
@prashanth26
Copy link
Contributor

prashanth26 commented Mar 30, 2021

Adding feedback from #549, #528

@amshuman-kr amshuman-kr modified the milestones: 2021-Q2, 2021-Q3 Jun 7, 2021
@himanshu-kun himanshu-kun modified the milestones: 2021-Q3, 2022-Q2 Jan 6, 2022
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 14, 2022
@himanshu-kun himanshu-kun added priority/2 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority) labels Feb 20, 2023
@himanshu-kun himanshu-kun removed this from the 2022-Q2 milestone Feb 20, 2023
@elankath
Copy link
Contributor

elankath commented Feb 20, 2023

We need to introduce metrics for following cases:

@elankath elankath changed the title Improve Monitoring and Alerting for Worker Machines. Improve Monitoring/Alerting/Metrics Feb 21, 2023
@gardener-robot
Copy link

@elankath You have mentioned internal references in the public. Please check.

@himanshu-kun himanshu-kun pinned this issue Apr 5, 2023
@himanshu-kun himanshu-kun unpinned this issue Nov 24, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers platform/all priority/2 Priority (lower number equals higher priority) topology/seed Affects Seed clusters
Projects
None yet
Development

No branches or pull requests

10 participants