Alert when mimir components are restarting too often accross all pipe… #1093

QuentinBisson · 2024-03-28T12:34:08Z

…lines to avoid high storage cost

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.

How to create a SLO rule: https://github.com/giantswarm/sloth-rules#how-to-create-a-slo
Documentation: https://intranet.giantswarm.io/docs/monitoring/slo-alerting/

Towards: #1090

This PR duplicates #1090 for mimir components

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

…lines to avoid high storage cost

QuantumEnigmaa

LGTM

marieroque · 2024-03-28T14:40:27Z

helm/prometheus-rules/templates/alerting-rules/mimir.rules.yml

+        description: '{{`Mimir containers are restarting too often.`}}'
+      expr: |
+        increase(
+          kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}[1h]


I just executed that rules on grizzly and it will be paged because of prometheus-buddy restarting a lot...

Then that's something we have to fix :)

Can we scope this alert to pods requested S3 only ?

I removed the buddy out of the equation and created a PM to investigate what is happening https://github.com/giantswarm/giantswarm/issues/30403

Considering this is a BH only alert (at least now), I would prefer that we keep it too broad if we add new components with s3 access for now and reduce the scope later on but I'm open to suggestions from @hervenicol and @QuantumEnigmaa

marieroque · 2024-03-28T15:11:59Z

test/tests/providers/global/mimir.rules.test.yml

@@ -86,3 +82,28 @@ tests:
              description: "Mimir ruler is failing to process PrometheusRules."
      - alertname: MimirRulerEventsFailed
        eval_time: 160m
+  - interval: 1m
+    input_series:
+      - series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}'


Suggested change

- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}'

- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir", container!="prometheus"}'

I think we need to set a container that is not prometheus 😅

But then we probably need another input with container=prometheus

It's fixed in later version :)

Alert when mimir components are restarting too often accross all pipe…

0a42da9

…lines to avoid high storage cost

QuentinBisson self-assigned this Mar 28, 2024

QuentinBisson requested a review from a team as a code owner March 28, 2024 12:34

QuantumEnigmaa approved these changes Mar 28, 2024

View reviewed changes

marieroque reviewed Mar 28, 2024

View reviewed changes

Ignore the prometheus-buddy

3fad898

marieroque reviewed Mar 28, 2024

View reviewed changes

Improve tests

8632b4d

QuentinBisson force-pushed the add-cloud-protection-alert-for-mimir branch from 7752ee2 to 8632b4d Compare March 28, 2024 15:47

marieroque approved these changes Mar 28, 2024

View reviewed changes

QuentinBisson merged commit 893f0b8 into master Mar 28, 2024
5 checks passed

QuentinBisson deleted the add-cloud-protection-alert-for-mimir branch March 28, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert when mimir components are restarting too often accross all pipe… #1093

Alert when mimir components are restarting too often accross all pipe… #1093

QuentinBisson commented Mar 28, 2024

QuantumEnigmaa left a comment

marieroque Mar 28, 2024

QuentinBisson Mar 28, 2024

marieroque Mar 28, 2024

QuentinBisson Mar 28, 2024 •

edited

Loading

QuentinBisson Mar 28, 2024 •

edited

Loading

marieroque Mar 28, 2024

QuentinBisson Mar 28, 2024

QuentinBisson Mar 28, 2024

QuentinBisson Mar 28, 2024

	- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}'
	- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir", container!="prometheus"}'

Alert when mimir components are restarting too often accross all pipe… #1093

Alert when mimir components are restarting too often accross all pipe… #1093

Conversation

QuentinBisson commented Mar 28, 2024

Checklist

QuantumEnigmaa left a comment

Choose a reason for hiding this comment

marieroque Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

marieroque Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

marieroque Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

QuentinBisson Mar 28, 2024 •

edited

Loading

QuentinBisson Mar 28, 2024 •

edited

Loading