Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing proper runbooks to some prometheus alerts #2377

Open
2 tasks
HaoruiPeng opened this issue Dec 18, 2024 · 0 comments
Open
2 tasks

Missing proper runbooks to some prometheus alerts #2377

HaoruiPeng opened this issue Dec 18, 2024 · 0 comments
Labels
app/prometheus Prometheus - Metrics Collection kind/improvement Improvement of existing features, e.g. code cleanup or optimizations.

Comments

@HaoruiPeng
Copy link
Contributor

Description

A follow-up to #2374, there are some installed alerts in Welkin having urls in their annotations but linked to non-existing runbooks. Some alerts are moved from upstream and some are created for Welkin.

For the alerts moved from upstream, same runbook URLs are used and they are just expecting contributions to the runbooks. As mentioned in prometheus-operator/kube-prometheus#1535 and povilasv/coredns-mixin#15
The links can be kept and will eventually work if someone contributes to the upstream, or we create the runbooks ourselves if necessary.

For alerts created by us for Welkin, we should indetify if they need runbooks or not. Either created the runbook or remove the runbook URLs to avoid further confusion.

Additional context

Here is a list of alerts with this issue, in 3 types:
1. Prometheus official alerts:
KubeJobNotCompleted
PrometheusScrapeBodySizeLimitHit
PrometheusScrapeSampleLimitHit

2. Not Prometheus official alerts but has upstream (all DNS alerts):
CorednsDown
CorednsLatencyHigh
CorednsErrorsHigh
CorednsErrorsHigh
CorednsForwardLatencyHigh
CorednsForwardErrorsHigh
CorednsForwardErrorsHigh
CorednsForwardHealthcheckFailureCount
CorednsForwardHealthcheckBrokenCount
CorednsPanicCount
Node-Local-DnsDown
Node-Local-DnsLatencyHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsForwardLatencyHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardHealthcheckFailureCount
Node-Local-DnsForwardHealthcheckBrokenCount
Node-Local-DnsPanicCount

3. Welkin alerts without upstream:
LessKubeletsThanNodes
HarborBackupHaveFailed24Hours
HarborBackupHaveFailed48Hours
OpenSearchBackupHaveFailed24Hours
OpenSearchBackupHaveFailed48Hours
OpenSearchSnapshotHaveFailed24Hours
OpenSearchSnapshotHaveFailed48Hours
VeleroBackupHaveFailed24Hours
VeleroBackupHaveFailed48Hours

Definition of done

  • Decide if runbooks are necessary for the mentioned alerts, remove the runbook_url for welkin alerts (type 3) that don't need runbooks.
  • Create runbooks if necessary, and change the urls to newly created runbooks.
@HaoruiPeng HaoruiPeng added kind/improvement Improvement of existing features, e.g. code cleanup or optimizations. app/prometheus Prometheus - Metrics Collection labels Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app/prometheus Prometheus - Metrics Collection kind/improvement Improvement of existing features, e.g. code cleanup or optimizations.
Projects
None yet
Development

No branches or pull requests

1 participant