-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager sending duplicate notifications after 'resolved' notification when running with multiple replicas #4008
Comments
Hi! 👋 There are a number of situations where this can happen. Please share debug level logs for both Alertmanager servers at the time this happened and we can help understand out what caused this. |
Hey @grobinson-grafana!!
So, the above logs are from the moment a fresh alert was fired. 30 minutes later, this alert was resolved. BUT, at the very moment we receive the 'resolved' message, it throws another 'firing' (which should not happen because data ingestion has been stopped totally). Now, this duplicate/extra 'fired' alert gets resolved instantaneously too. Overall, there's one duplicate/extra pair of 'fired'/'resolved' alerts. |
Hi! 👋 What's happening here is an unfortunate side effect of how high availability works. I don't think there is anything you can do about this either I'm afraid. The sequence of firing, resolved, firing resolved notifications can occur in rare cases when Prometheus sends the resolved alert to Alertmanager around the same time as the next flush. When this happens, some Alertmanager replicas can see the alert as resolved while others can still see it as active. This is what happened here. We can see that Alertmanager 3 received the resolved alert from Prometheus. At the same time, it flushed the alert, as it had been 5 minutes since the last flush (your
However, if we look at Alertmanager 2, we can see that it flushed the alert 26ms before it received the resolved alert from Prometheus. Alertmanager 2 still thinks the alert is firing:
What happens next is Alertmanager 3 sends the resolved notification, and gossips that it has sent a resolved notification to Alertmanagers 1 and 2:
Alertmanager 2 waits 15 seconds (
Alertmanager 1 waits 30 seconds (
Alertmanager 1 then sends the resolved notification, and gossips to Alertmanagers 2 and 3 that a resolved notification was sent:
The next time Alertmanager 2 flushes it will see that the alert is resolved. And since the last notification sent was a resolved notification, it will do nothing. I hope this helps! |
Thanks @grobinson-grafana for helpful explanation of the root cause. |
It might help. You'll need to test it to find out I'm afraid. Remember though, 5m and 7m align at minutes 70, 140, etc, so there will still be overlap between evaluations and flushes. Another option is to disable resolved notifications, as sometimes resolved notifications can create a lot of noise and even flap (as is the case here). However, this also depends on how critical resolved notifications are for your monitoring. |
Hey @grobinson-grafana!! I am trying to debug the logs where 2 pairs (fired and resolved) of duplicates were notified. I have a few doubts:
I just want to point out in the logs myself where exactly the race between flush and alert being received, has occurred. |
Neither, after the notification is sent.
The peers are arranged into positions, such as
Starts when the flush happens. |
Hey @grobinson-grafana!! I was debugging the logs to point out the race condition on a fresh set of logs (attached). I observed there's more than just a race between flushing of alerts and a new alert being received. I have tried to capture the flow of events in each Alertmanager instances as per the new logs. Here, at timestamp This is just the case just with new logs but also in older ones (the ones attached in older comment). At timestamp
Isn't this a bigger issue than just rare event of race condition occurring? Or, am I missing to comprehend something here? |
Hi! 👋 Yes, missing |
Hi @grobinson-grafana!!
|
Hi! 👋
|
Hey @grobinson-grafana!! We changed VM Alert config and made sure that each replica of Alertmanager gets alert from upstream.
Now, subsequent flushes for AM1 are at
AM3 logs:
AM2 logs:
AM1 logs:
Do you suspect something wrong with cluster causing this?
|
|
Sure, I'll check the mentioned issue.
Thanks a lot for clarification @grobinson-grafana! I misunderstood the So, I believe, we can conclude that the root cause is a race condition where flushing, in one or more instances of the AM, occurs just milliseconds before they receive a 'resolved' alert, causing a disagreement among the peers of the cluster and consequently duplicate notifications. Also, unfortunately, we have no fix for this. |
Hey @grobinson-grafana!
|
Hi there, just checking in to see if there are any updates, fixes or workarounds regarding this issue. :) |
What did you do?
I have VM (Victoriametrics) Alert running with Alertmanager, with 2 replicas each. I am ingesting the metric disk_usage with value > threshold value. As soon as I receive an email from alertmanager (which is after 20 minutes in my case, see below files), I am stopping the data ingestion, which stops the alert breach as well.
What did you expect to see?
Alertmanager should send a 'fired' notification and a 'resolved' notification after I stop data ingestion because breach is stopped.
What did you see instead? Under which circumstances?
Alertmanager sends a 'fired' notification, but when the ingestion/breach is stopped, it sends a 'resolved' notification along with another 'fired' notification. And, 'resolved' email for this duplicate 'fired' email comes either instantaneously or in the next group_interval. Also, sometimes I see arbitrarily multiple pairs of duplicate 'fired-resolved' mails even after breach is stopped.
This unexpected behaviour is seen only when alertmanager is running with multiple (>=2) replicas.
Environment
v.0.27.0
The text was updated successfully, but these errors were encountered: