Missing proper runbooks to some prometheus alerts #2377

HaoruiPeng · 2024-12-18T16:29:23Z

Description

A follow-up to #2374, there are some installed alerts in Welkin having urls in their annotations but linked to non-existing runbooks. Some alerts are moved from upstream and some are created for Welkin.

For the alerts moved from upstream, same runbook URLs are used and they are just expecting contributions to the runbooks. As mentioned in prometheus-operator/kube-prometheus#1535 and povilasv/coredns-mixin#15
The links can be kept and will eventually work if someone contributes to the upstream, or we create the runbooks ourselves if necessary.

For alerts created by us for Welkin, we should indetify if they need runbooks or not. Either created the runbook or remove the runbook URLs to avoid further confusion.

Additional context

Here is a list of alerts with this issue, in 3 types:
1. Prometheus official alerts:
KubeJobNotCompleted
PrometheusScrapeBodySizeLimitHit
PrometheusScrapeSampleLimitHit

2. Not Prometheus official alerts but has upstream (all DNS alerts):
CorednsDown
CorednsLatencyHigh
CorednsErrorsHigh
CorednsErrorsHigh
CorednsForwardLatencyHigh
CorednsForwardErrorsHigh
CorednsForwardErrorsHigh
CorednsForwardHealthcheckFailureCount
CorednsForwardHealthcheckBrokenCount
CorednsPanicCount
Node-Local-DnsDown
Node-Local-DnsLatencyHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsForwardLatencyHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardHealthcheckFailureCount
Node-Local-DnsForwardHealthcheckBrokenCount
Node-Local-DnsPanicCount

3. Welkin alerts without upstream:
LessKubeletsThanNodes
HarborBackupHaveFailed24Hours
HarborBackupHaveFailed48Hours
OpenSearchBackupHaveFailed24Hours
OpenSearchBackupHaveFailed48Hours
OpenSearchSnapshotHaveFailed24Hours
OpenSearchSnapshotHaveFailed48Hours
VeleroBackupHaveFailed24Hours
VeleroBackupHaveFailed48Hours

Definition of done

Decide if runbooks are necessary for the mentioned alerts, remove the runbook_url for welkin alerts (type 3) that don't need runbooks.
Create runbooks if necessary, and change the urls to newly created runbooks.

HaoruiPeng added kind/improvement Improvement of existing features, e.g. code cleanup or optimizations. app/prometheus Prometheus - Metrics Collection labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing proper runbooks to some prometheus alerts #2377

Missing proper runbooks to some prometheus alerts #2377

HaoruiPeng commented Dec 18, 2024

Missing proper runbooks to some prometheus alerts #2377

Missing proper runbooks to some prometheus alerts #2377

Comments

HaoruiPeng commented Dec 18, 2024

Description

Additional context

Definition of done