-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: label mismatch for alertmanager_notifications_failed_total #3599
Conversation
d8744f2
to
37d0dac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing it!
@@ -44,7 +44,7 @@ | |||
( | |||
rate(alertmanager_notifications_failed_total{%(alertmanagerSelector)s}[5m]) | |||
/ | |||
rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m]) | |||
on (integration) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to ignore the reason
label since it's the only one that's different between the 2 metrics. It will change a bit the alert condition in case of multiple reasons (e.g. the sum for all reasons might be above the threshold while the individual reasons are below). But the current threshold value is low enough that it shouldn't be an issue in practice.
on (integration) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m]) | |
ignoring (reason) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. Fixed.
37d0dac
to
f5c69b9
Compare
Signed-off-by: chengzw <[email protected]>
f5c69b9
to
aff09c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Since Alertmanager 0.26.0 version, a new label
reason
was added in thealertmanager_notifications_failed_total
metric to indicate the type of error of the alert delivery.As a result, the original alert rules are broken because labels are mismatched between
alertmanager_notifications_failed_total
andalertmanager_notifications_total metrics
.Prometheus requires samples with exactly the same labels to get matched together when performing calculations. docs
Use the
ignoring
vector matching keyword to ignore the newreason
label to allow for matching between series with different labels.