Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD-22337 route some customer alerts to null receiver #342

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tkong-redhat
Copy link

@tkong-redhat tkong-redhat commented Oct 1, 2024

There are some customer defined alerts which are picked up by CAMO and routed to Red Hat pagerduty due to customer set some labels with Red Hat managed namespaces.

Adding some route rules to redirect those alerts to NULL receiver so that Red Hat SRE will not be paged.

Rule has been tested on a stg cluster.

Copy link
Contributor

openshift-ci bot commented Oct 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tkong-redhat
Once this PR has been reviewed and has the lgtm label, please assign mmazur for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Oct 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.19%. Comparing base (cdb7cd9) to head (328c19f).
Report is 26 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
+ Coverage   66.01%   66.19%   +0.18%     
==========================================
  Files           7        7              
  Lines         915      920       +5     
==========================================
+ Hits          604      609       +5     
  Misses        288      288              
  Partials       23       23              
Files with missing lines Coverage Δ
controllers/secret_controller.go 91.37% <100.00%> (+0.07%) ⬆️

@dustman9000
Copy link
Member

/hold for discussion

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2024
@@ -457,6 +457,13 @@ func createSubroutes(namespaceList []string, receiver receiverType) *alertmanage
// Route ClusterOperatorDown for insights to null receiver https://issues.redhat.com/browse/OSD-19800
// Also needs to be silenced for FedRAMP until its made available in the environment https://issues.redhat.com/browse/OSD-13685
{Receiver: receiverNull, Match: map[string]string{"alertname": "ClusterOperatorDown", "name": "insights"}},
// Route some customer defined alerts to null receiver
// https://issues.redhat.com/browse/OSD-22337
{Receiver: receiverNull, Match: map[string]string{"alertname": "memory-request"}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a great long term solution to filter on alertname, is there a better way to identify these "user defined" alerts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short term solution. There is no easy way to identify those "user defined" alerts. We had several discussion about this. There is a doc about our finding https://docs.google.com/document/d/1OscbdlZ-aBuwY7YKJsU5URyDvfS9A4-VV4yBhpowfy4/edit#heading=h.i7a6b1h441g

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use some better labels to distinguish between Red Hat defined alert and user defined alert. But that requires more discussion and nearly impact all the existing alerts. That's why we put a short term solution here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a great long term solution to filter on alertname, is there a better way to identify these "user defined" alerts?

Agree, looks this PR’s current logic relies on matching specific alertname values to identify user-defined alerts and route them to a null. However, if any change in the alert name or new customer-defined alerts not explicitly added in this list can bypass the receiever.

probably use a label user_alert=true or something similar to match alerts instead of relying on alertnames?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I would prefer long term solution which is adding some common labels so we could easily distinguish user defined alerts and Red Hat defined alerts. That would be lovely and this makes our life easier. Unfortunately, we don't have this type of definition in any our docs, so there is no standard labels to be used ATM. It requires a lot of effort to discuss with BU and wider teams about what should be the labels used and make a agreement.

Copy link
Contributor

openshift-ci bot commented Dec 9, 2024

@tkong-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/test 328c19f link true /test test
ci/prow/validate 328c19f link true /test validate
ci/prow/coverage 328c19f link true /test coverage
ci/prow/lint 328c19f link true /test lint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants