Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Alerts for Prometheus Remote Writes #146

Conversation

arjunrn
Copy link
Contributor

@arjunrn arjunrn commented Mar 10, 2021

Remote writes of metrics data to Observatorium can occasionally fail due to an outage. This triggers an alert in-cluster. To prevent alert fatigue let's disable the alerts related to remote writes.

Signed-off-by: Arjun Naik <[email protected]>
@codecov
Copy link

codecov bot commented Mar 10, 2021

Codecov Report

Merging #146 (6adad5e) into master (284374c) will increase coverage by 0.22%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
+ Coverage   64.23%   64.46%   +0.22%     
==========================================
  Files           8        8              
  Lines         467      470       +3     
==========================================
+ Hits          300      303       +3     
  Misses        153      153              
  Partials       14       14              
Impacted Files Coverage Δ
pkg/controller/secret/secret_controller.go 88.88% <100.00%> (+0.11%) ⬆️

Copy link
Contributor

@jharrington22 jharrington22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2021
@jharrington22
Copy link
Contributor

@cblecker @lisa @jewzaam this PR is to silence prometheus alerts for remote writes to OSD's Observatorium. These alerts haven't been seen before because remote write is not enabled on clusters. As the Observatorium team is scaling up to handle the load we want to initially silence these alerts to ensure we don't fatigue SRE. We have a plan to reenable these alerts https://issues.redhat.com/browse/OSD-6709

@jewzaam
Copy link
Member

jewzaam commented Mar 11, 2021

Before removing hold please add a card tracking undoing this with condition being Observatorium has an SLA. We care about these metrics for reporting status of clusters via OCM at a minimum. Once the Observatorium service is better supported we need to care about when it's offline, at least so we can ship those alerts to the Observatorium team.

/hold
/approve

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arjunrn, jewzaam, jharrington22

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2021
@jewzaam
Copy link
Member

jewzaam commented Mar 11, 2021

Guess @jharrington22 provided what I asked for right when I asked. Thanks! Removing hold. :shipit:

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021
@openshift-merge-robot openshift-merge-robot merged commit f9dde14 into openshift:master Mar 11, 2021
@arjunrn arjunrn deleted the disable-prometheus-remote-write-alerts branch March 11, 2021 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants