Missing proper runbooks to some prometheus alerts #2377
Labels
app/prometheus
Prometheus - Metrics Collection
kind/improvement
Improvement of existing features, e.g. code cleanup or optimizations.
Description
A follow-up to #2374, there are some installed alerts in Welkin having urls in their annotations but linked to non-existing runbooks. Some alerts are moved from upstream and some are created for Welkin.
For the alerts moved from upstream, same runbook URLs are used and they are just expecting contributions to the runbooks. As mentioned in prometheus-operator/kube-prometheus#1535 and povilasv/coredns-mixin#15
The links can be kept and will eventually work if someone contributes to the upstream, or we create the runbooks ourselves if necessary.
For alerts created by us for Welkin, we should indetify if they need runbooks or not. Either created the runbook or remove the runbook URLs to avoid further confusion.
Additional context
Here is a list of alerts with this issue, in 3 types:
1. Prometheus official alerts:
KubeJobNotCompleted
PrometheusScrapeBodySizeLimitHit
PrometheusScrapeSampleLimitHit
2. Not Prometheus official alerts but has upstream (all DNS alerts):
CorednsDown
CorednsLatencyHigh
CorednsErrorsHigh
CorednsErrorsHigh
CorednsForwardLatencyHigh
CorednsForwardErrorsHigh
CorednsForwardErrorsHigh
CorednsForwardHealthcheckFailureCount
CorednsForwardHealthcheckBrokenCount
CorednsPanicCount
Node-Local-DnsDown
Node-Local-DnsLatencyHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsErrorsHigh
Node-Local-DnsForwardLatencyHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardErrorsHigh
Node-Local-DnsForwardHealthcheckFailureCount
Node-Local-DnsForwardHealthcheckBrokenCount
Node-Local-DnsPanicCount
3. Welkin alerts without upstream:
LessKubeletsThanNodes
HarborBackupHaveFailed24Hours
HarborBackupHaveFailed48Hours
OpenSearchBackupHaveFailed24Hours
OpenSearchBackupHaveFailed48Hours
OpenSearchSnapshotHaveFailed24Hours
OpenSearchSnapshotHaveFailed48Hours
VeleroBackupHaveFailed24Hours
VeleroBackupHaveFailed48Hours
Definition of done
The text was updated successfully, but these errors were encountered: