Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve mimir compactor failures detection #1425

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

hervenicol
Copy link
Contributor

Towards: https://github.com/giantswarm/giantswarm/issues/32027

This PR improves Mimir compaction issues detection. See main issue for details.

Checklist

@hervenicol hervenicol self-assigned this Nov 15, 2024
@hervenicol hervenicol requested a review from a team as a code owner November 15, 2024 16:51
@hervenicol hervenicol force-pushed the mimir-compactor-improvement branch from e0567a4 to 572d413 Compare November 15, 2024 17:47
@@ -158,8 +158,7 @@ spec:
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources
description: 'Mimir compactor has been failing its compactions for 2 hours.'
opsrecipe: mimir#mimircompactorfailedcompaction
# Query is based on the following upstream mixin alerting rule : https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L858
expr: sum(increase(cortex_compactor_runs_failed_total{reason!="shutdown"}[2h])) by (cluster_id, installation, namespace, pipeline, provider) > 2
expr: min by (cluster_id, installation, namespace, provider, pipeline) (time() - (cortex_compactor_last_successful_run_timestamp_seconds)) > 60 * 60 * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you removed the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not based on the upstream mixin alerting rule anymore. So the comment does not make sense anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i'm tired 😅
Anyway, as you say, the alert does not fire if the value is 0 and that's why they need to have the since startup alert in both Loki and Mimir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, I finally remember this 0 condition and why the unit tests work! Because time() starts at 0 for the tests! 🤦
I'll fix it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that time function is tricky

Copy link
Contributor

@QuentinBisson QuentinBisson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hervenicol hervenicol merged commit 4dec85e into main Nov 18, 2024
7 checks passed
@hervenicol hervenicol deleted the mimir-compactor-improvement branch November 18, 2024 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants