-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve mimir compactor failures detection #1425
Conversation
e0567a4
to
572d413
Compare
@@ -158,8 +158,7 @@ spec: | |||
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources | |||
description: 'Mimir compactor has been failing its compactions for 2 hours.' | |||
opsrecipe: mimir#mimircompactorfailedcompaction | |||
# Query is based on the following upstream mixin alerting rule : https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L858 | |||
expr: sum(increase(cortex_compactor_runs_failed_total{reason!="shutdown"}[2h])) by (cluster_id, installation, namespace, pipeline, provider) > 2 | |||
expr: min by (cluster_id, installation, namespace, provider, pipeline) (time() - (cortex_compactor_last_successful_run_timestamp_seconds)) > 60 * 60 * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you removed the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's not based on the upstream mixin alerting rule anymore. So the comment does not make sense anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i'm tired 😅
Anyway, as you say, the alert does not fire if the value is 0 and that's why they need to have the since startup
alert in both Loki and Mimir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I finally remember this 0 condition and why the unit tests work! Because time()
starts at 0 for the tests! 🤦
I'll fix it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that time function is tricky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Towards: https://github.com/giantswarm/giantswarm/issues/32027
This PR improves Mimir compaction issues detection. See main issue for details.
Checklist
oncall-kaas-cloud
GitHub group).