-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: provide a way to acknowledge a firing alert #1860
Comments
Human responses to alerts are out of scope for the alertmanager, this is better handled by a system such as PagerDuty. The Alertmanager is just about delivering Prometheus-generated notifications. |
Is there any technical limitation that prevents auto-expiring silences from being implemented? |
Auto-expiring silences are not wise and would be challenging to implement, see #1057. See also https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped |
Also, this would make creating silences in advance impossible as they'd be auto-deleted. |
Cross posting from the above linked email thread:
This is turning alertmanager into an incident response platform, when it's purpose is to group, deduplicate, and send the notification to the user's incident response provider of choice (pagerduty, ops genie, webhook, etc). To me, acknowledging that an alert has been received and it is currently being addressed makes the most sense at the end of this incident chain (prometheus->alertmanager->provider), rather than having two places to do it that could be out of sync. (from the email chain)
If they are small issues that have been deemed not worthy of paging (i.e. being routed to pagerduty), a user creating the silence and writing in the comment metadata that they're working on it, and then deleting it when finished, seems appropriate. Just because an alert has stopped firing (and in this scenario, expires its silence), doesn't mean that the situation has resolved. Auto-expiring silences could lead to duplicate work more easily than the engineer responsible creating and manually expiring the silence when the alert has been resolved.
I think we already support that with silences, and as stated above, I think making them auto-expire would end up being problematic. |
I agree that "acknowledging" is not the right thing to do. On the other hand, I do think that automatically-expiring alerts have their place and are useful.
this can easily be solved.
That is true, but there are many situations where, when a specific alert has stopped firing, this means the situation. I would not make this behavior the default.
I don't want to do things manually that a computer can do for me. Sometimes, I'm not even awake when the situation resolves – say, a job that is failing because a dependency produced garbage data. I'm re-running the dependency, and I expect that once it finishes, the failing job will recover. If that happens, and later it fails again, I want to know immediately, because something else has happened. I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence. In #1057 @brian-brazil you said
Could you please elaborate on how exactly this would not work, where manually expiring silences does? |
That's a matter of setting an appropriate expiry time on the silence.
I can't remember offhand, but it was probably something to do with a network partition. What happens if one side deletes a silence and the other doesn't? |
What would happen if I pressed "expire now" on one side of the partition? |
I don't always know, by a factor of 2-4, how long this will take. I don't want to have to do a whole lot of math either, if I could rather say "whenever it's done is the right time, let me know if it's still an issue tomorrow morning". |
That's only if all silences auto expire rather than only those with a flag Would it be possible to have ability to set extra annotations from Alertmanager itself that would be added to the firing alert? That way someone could add a note that persists only as long as the alert keeps firing. |
We do this right now, it sorta works. You silence something for a few hours and it typically is enough. But once in a while you miss issue re-apearing after you thing you fixed it or you set too long expiry time and you forget to unsilence. |
I'd personally set it to tomorrow morning if it could wait, rather than risking waking myself up again. |
the key word being "personally". I think this feature would not prevent you from following your style, but it allows others to use a different one that maybe works better for that specific circumstance. As @prymitive said, this can lead to missing new events. |
Pre-creating alerts would work as it does now if the auto-expiry only triggers on the N>0 -> N=0 transition of an alert group, or remembers that it has silenced at least one alert in the past. |
That's additional state to manage. What happens if a Prometheus restarts, and takes long enough that the alerts resolve? |
If that's the case, then you need to adjust the `resolve_timeout` anyway.
…On Fri, Apr 26, 2019 at 12:16 PM Brian Brazil ***@***.***> wrote:
That's additional state to manage. What happens if a Prometheus restarts,
and takes long enough that the alerts resolve?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1860 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABAEBTAUNHDOXPTUPRDHWDPSLXBJANCNFSM4HIUOGAA>
.
|
That's not relevant here as Prometheus is what chooses the end time, which is a few evaluation intervals so we could be easily talking less than a minute. Alerts flapping is normal, and we should be robust to it. |
Doesn't happen frequently from my experiance, as in I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager, it's only a problem when Prometheus is down due to bad config restart. And if that's the case then is that really a blocker for this as it sounds like an unrelated problem (?). |
User bug reports indicate otherwise.
It'd affect the reliability of any such solution, as when things would get unsilenced is not predictable. |
If there are users who hit flapping issues then they already have a problem, reliability doesn't get any worse then it already is for them. So should that really be a blocker? There's always a corner case for everything. |
I wrote a tiny daemon that keep extending silences as long as there are alerts matching them. When alert fires I'll silence it for 10 minutes with |
How do you deal with clustering? Does one alertmanager get to decide to resolve the silence for all the alertmanagers? In such a way that silences would go away if one alertmanager is partitioned from one prometheus servers but not the other AM? |
I don't deal with clustering at all. |
That was a question about how alertmanager could implement this, not a question about kthxbye :) The issue with not dealing with it is how to debug why a notification is un-silenced in a clustered setup. |
My bad, though you were responding to my comment |
I have an proposal regarding this topic:
The first one allows for a quicker creation of unlimited silences, which shall not be resolved by time, but by manual/automatic actions. The second one lets a silence automatically expire when at least a single alert of the silenced alerts is resolved. What are the thoughts about this? |
Very much like the idea of this option (although it wouldn't be perfect for all cases). We tend to use long-term silences and periodically delete them to do this. |
When a whole team tries to take an action on multiple alerts during an incident it requires a lot of communication effort to coordinate who is dealing with which alert.
Some systems (like PD) provide a way to acknowlage an alert and assign it to a specific person (usually the on-call person), but that requires to route every alert via such system. Also during an incident often people try to help and volunteer to handle some of alerts, so the usuall routing of alerts might not cover that.
It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of
endsAt
value), which was previously suggested on #1057 but not accepted.The text was updated successfully, but these errors were encountered: