Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional monitoring rules to the PrometheusRule #791

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gnunn1
Copy link
Collaborator

@gnunn1 gnunn1 commented Oct 15, 2024

What type of PR is this?
/kind enhancement

What does this PR do / why we need it:

This PR provides additional rules for alerting, specifically it captures the following situations:

  1. Application is in an Unknown Sync state (critical)
  2. Application is in a Degraded Health state (critical)
  3. Application has been Progressing for more then 10 minutes (warning)
  4. Application is in a Health state that isn't Healthy, Degraded, Suspended or Progressing (warning). Note we exclude Degraded and Progressing since they are captured by other rules.

This helps users better monitor Argo CD Applications using the built-in OpenShift monitoring stack.

A couple of additional comments:

  1. The progressing for more then 10 minutes might ruffle some feathers since the Health check for Subscriptions leaves it in a Progressing state rather then Suspended. I'm working on adjusting the health check for upstream but it's not there yet. Note the alert can be silenced if customers find it annoying, we could also lower the severity to info.

  2. I chose to make Unknown for Sync State critical since it means the Application is not syncing properly. However if folks feel like this is too high it can be dropped down to warning. If we do this it can be combined with the ArgoCDSyncAlert since they would share the same severity.

  3. I wanted to change the name of ArgoCDSyncAlert to ArgoCDOutOfSyncAlert but realized that customers may have monitoring and configuration depending on this name so I have left it the same as now.

Have you updated the necessary documentation?

  • Documentation update is required by this PR.
  • Documentation has been updated.

The documentation does not mention specific alerts AFAIK so I do not feel like it needs to be covered. However this should be included in the release notes.

Which issue(s) this PR fixes:

https://issues.redhat.com/browse/GITOPS-4873

Test acceptance criteria:

  • [*] Unit Test
  • E2E Test

Updated unit tests however I wonder if the way I'm doing it could be improved by parameterizing the MonitoringRules and then having both the code and unit tests share the same definitions?

How to test changes / Special notes to the reviewer:

Deploy applications with bad sync and health statues and verify that OpenShift Alerts are triggering after the alert duration expires.

@openshift-ci openshift-ci bot added the kind/enhancement New feature or request label Oct 15, 2024
@openshift-ci openshift-ci bot requested review from trdoyle81 and wtam2018 October 15, 2024 21:53
Copy link

openshift-ci bot commented Oct 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign svghadi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gnunn1
Copy link
Collaborator Author

gnunn1 commented Oct 16, 2024

/retest

Copy link

openshift-ci bot commented Nov 26, 2024

@gnunn1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/v4.17-e2e b49ab92 link true /test v4.17-e2e
ci/prow/v4.17-kuttl-parallel b49ab92 link true /test v4.17-kuttl-parallel
ci/prow/v4.17-kuttl-sequential b49ab92 link true /test v4.17-kuttl-sequential

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant