-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert when mimir components are restarting too often accross all pipe… #1093
Conversation
…lines to avoid high storage cost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
description: '{{`Mimir containers are restarting too often.`}}' | ||
expr: | | ||
increase( | ||
kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}[1h] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just executed that rules on grizzly
and it will be paged because of prometheus-buddy restarting a lot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then that's something we have to fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we scope this alert to pods requested S3 only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the buddy out of the equation and created a PM to investigate what is happening https://github.com/giantswarm/giantswarm/issues/30403
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this is a BH only alert (at least now), I would prefer that we keep it too broad if we add new components with s3 access for now and reduce the scope later on but I'm open to suggestions from @hervenicol and @QuantumEnigmaa
@@ -86,3 +82,28 @@ tests: | |||
description: "Mimir ruler is failing to process PrometheusRules." | |||
- alertname: MimirRulerEventsFailed | |||
eval_time: 160m | |||
- interval: 1m | |||
input_series: | |||
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir"}' | |
- series: 'kube_pod_container_status_restarts_total{cluster_type="management_cluster", namespace="mimir", container!="prometheus"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to set a container that is not prometheus 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we probably need another input with container=prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fixed in later version :)
7752ee2
to
8632b4d
Compare
…lines to avoid high storage cost
Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.
Towards: #1090
This PR duplicates #1090 for mimir components
Checklist
oncall-kaas-cloud
GitHub group).