Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert when mimir components are restarting too often accross all pipe… #1093

Merged
merged 3 commits into from
Mar 28, 2024

Conversation

QuentinBisson
Copy link
Contributor

…lines to avoid high storage cost

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.


Towards: #1090

This PR duplicates #1090 for mimir components

Checklist

@QuentinBisson QuentinBisson self-assigned this Mar 28, 2024
@QuentinBisson QuentinBisson requested a review from a team as a code owner March 28, 2024 12:34
Copy link
Contributor

@QuantumEnigmaa QuantumEnigmaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

description: '{{`Mimir containers are restarting too often.`}}'
expr: |
increase(
kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}[1h]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just executed that rules on grizzly and it will be paged because of prometheus-buddy restarting a lot...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then that's something we have to fix :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we scope this alert to pods requested S3 only ?

Copy link
Contributor Author

@QuentinBisson QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the buddy out of the equation and created a PM to investigate what is happening https://github.com/giantswarm/giantswarm/issues/30403

Copy link
Contributor Author

@QuentinBisson QuentinBisson Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is a BH only alert (at least now), I would prefer that we keep it too broad if we add new components with s3 access for now and reduce the scope later on but I'm open to suggestions from @hervenicol and @QuantumEnigmaa

@@ -86,3 +82,28 @@ tests:
description: "Mimir ruler is failing to process PrometheusRules."
- alertname: MimirRulerEventsFailed
eval_time: 160m
- interval: 1m
input_series:
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}'
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir", container!="prometheus"}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set a container that is not prometheus 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we probably need another input with container=prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fixed in later version :)

@QuentinBisson QuentinBisson force-pushed the add-cloud-protection-alert-for-mimir branch from 7752ee2 to 8632b4d Compare March 28, 2024 15:47
@QuentinBisson QuentinBisson merged commit 893f0b8 into master Mar 28, 2024
5 checks passed
@QuentinBisson QuentinBisson deleted the add-cloud-protection-alert-for-mimir branch March 28, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants