Fixes and improvements to kube-stack-prometheus alerts #2386

anders-elastisys · 2025-01-03T12:07:43Z

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
[kind/adr](set-me)

What does this PR do / why do we need this PR?

Was looking through some of the alerts and when comparing them to upstream noticed some deviations:

Some expressions in our alerts are not computer per cluster which is what we want.
Noticed that the alertmanager alerts we have did not have the runbookURL configured, this PR adds this.
We have alerts for kube-state-metrics but the metrics used in the queries for these alerts were missing due to the self monitoring option for kube-state-metrics being disabled by default. This PR enables it, with it enabled the issue encountered in this PR fires the kube-state-metrics alerts which it did not previously do. The metrics were exposed on port 8081 so the networkpolicies needed to be updated.

Fixes #

Information to reviewers

Checklist

Xartos · 2025-01-03T15:48:25Z

helmfile.d/charts/prometheus-alerts/templates/alerts/kube-state-metrics.yaml

@@ -23,9 +23,9 @@ spec:
        runbook_url: {{ .Values.defaultRules.runbookUrl }}kube-state-metrics/kubestatemetricslisterrors
        summary: kube-state-metrics is experiencing errors in list operations.
      expr: |-
-        (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
+        (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)


Question: Will we not loose a bunch of information when summing on cluster? Or maybe none of these actually tells anything other than that something is wrong with the cluster?

anders-elastisys added 4 commits January 3, 2025 12:42

apps sc: add runbookurls for alertmanager alerts

0c28075

apps sc: update and fix kubernetes-resources alerts

a34da82

apps sc: update kube-state-metrics alerts

9204e09

apps: enable kube-state-metrics metrics

0221a01

anders-elastisys requested review from a team as code owners January 3, 2025 12:07

Xartos reviewed Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes and improvements to kube-stack-prometheus alerts #2386

Fixes and improvements to kube-stack-prometheus alerts #2386

anders-elastisys commented Jan 3, 2025

Xartos Jan 3, 2025

Fixes and improvements to kube-stack-prometheus alerts #2386

Are you sure you want to change the base?

Fixes and improvements to kube-stack-prometheus alerts #2386

Conversation

anders-elastisys commented Jan 3, 2025

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

Xartos Jan 3, 2025

Choose a reason for hiding this comment