Feature request: provide a way to acknowledge a firing alert #1860

prymitive · 2019-04-26T10:28:39Z

When a whole team tries to take an action on multiple alerts during an incident it requires a lot of communication effort to coordinate who is dealing with which alert.
Some systems (like PD) provide a way to acknowlage an alert and assign it to a specific person (usually the on-call person), but that requires to route every alert via such system. Also during an incident often people try to help and volunteer to handle some of alerts, so the usuall routing of alerts might not cover that.
It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

The text was updated successfully, but these errors were encountered:

brian-brazil · 2019-04-26T10:35:13Z

Human responses to alerts are out of scope for the alertmanager, this is better handled by a system such as PagerDuty. The Alertmanager is just about delivering Prometheus-generated notifications.

prymitive · 2019-04-26T10:53:23Z

Is there any technical limitation that prevents auto-expiring silences from being implemented?
I think that auto-expiring silences are useful as a standalone feature and would be a good enough solution to the acknowlagement problem. Are those also out of scope or just the acknowlagement?

brian-brazil · 2019-04-26T11:13:10Z

Auto-expiring silences are not wise and would be challenging to implement, see #1057.

See also https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

brian-brazil · 2019-04-26T11:14:36Z

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

stuartnelson3 · 2019-04-26T11:22:21Z

Cross posting from the above linked email thread:

The only problem with using silences as a replacement is that if you already have plenty of silences (for broken hardware or other issues that take time to resolve) it becomes tricky to find all alerts that got acked but still require an action.
Silences that expire when alert is resolved sound very useful.

This is turning alertmanager into an incident response platform, when it's purpose is to group, deduplicate, and send the notification to the user's incident response provider of choice (pagerduty, ops genie, webhook, etc). To me, acknowledging that an alert has been received and it is currently being addressed makes the most sense at the end of this incident chain (prometheus->alertmanager->provider), rather than having two places to do it that could be out of sync.

(from the email chain)

Sometimes there's a flood of small issues and it's hard to tell who's fixing what just by looking at alerts.

If they are small issues that have been deemed not worthy of paging (i.e. being routed to pagerduty), a user creating the silence and writing in the comment metadata that they're working on it, and then deleting it when finished, seems appropriate. Just because an alert has stopped firing (and in this scenario, expires its silence), doesn't mean that the situation has resolved. Auto-expiring silences could lead to duplicate work more easily than the engineer responsible creating and manually expiring the silence when the alert has been resolved.

It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

I think we already support that with silences, and as stated above, I think making them auto-expire would end up being problematic.

matthiasr · 2019-04-26T11:41:01Z

I agree that "acknowledging" is not the right thing to do. On the other hand, I do think that automatically-expiring alerts have their place and are useful.

this would make creating silences in advance impossible

this can easily be solved.

Just because an alert has stopped firing, doesn't mean that the situation has resolved

That is true, but there are many situations where, when a specific alert has stopped firing, this means the situation. I would not make this behavior the default.

manually expiring the silence when the alert has been resolved

I don't want to do things manually that a computer can do for me. Sometimes, I'm not even awake when the situation resolves – say, a job that is failing because a dependency produced garbage data. I'm re-running the dependency, and I expect that once it finishes, the failing job will recover. If that happens, and later it fails again, I want to know immediately, because something else has happened. I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

In #1057 @brian-brazil you said

[this] wouldn't work with AM clustering

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

brian-brazil · 2019-04-26T11:46:05Z

I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

That's a matter of setting an appropriate expiry time on the silence.

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

I can't remember offhand, but it was probably something to do with a network partition. What happens if one side deletes a silence and the other doesn't?

matthiasr · 2019-04-26T11:50:53Z

What would happen if I pressed "expire now" on one side of the partition?

matthiasr · 2019-04-26T11:52:17Z

a matter of setting an appropriate expiry time

I don't always know, by a factor of 2-4, how long this will take. I don't want to have to do a whole lot of math either, if I could rather say "whenever it's done is the right time, let me know if it's still an issue tomorrow morning".

prymitive · 2019-04-26T11:54:30Z

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

That's only if all silences auto expire rather than only those with a flag autoExpire: true.

Would it be possible to have ability to set extra annotations from Alertmanager itself that would be added to the firing alert? That way someone could add a note that persists only as long as the alert keeps firing.

prymitive · 2019-04-26T11:55:58Z

a matter of setting an appropriate expiry time

We do this right now, it sorta works. You silence something for a few hours and it typically is enough. But once in a while you miss issue re-apearing after you thing you fixed it or you set too long expiry time and you forget to unsilence.

brian-brazil · 2019-04-26T11:59:39Z

let me know if it's still an issue tomorrow morning".

I'd personally set it to tomorrow morning if it could wait, rather than risking waking myself up again.

matthiasr · 2019-04-26T12:09:00Z

the key word being "personally". I think this feature would not prevent you from following your style, but it allows others to use a different one that maybe works better for that specific circumstance.

As @prymitive said, this can lead to missing new events.

matthiasr · 2019-04-26T12:11:01Z

Pre-creating alerts would work as it does now if the auto-expiry only triggers on the N>0 -> N=0 transition of an alert group, or remembers that it has silenced at least one alert in the past.

brian-brazil · 2019-04-26T12:16:12Z

That's additional state to manage. What happens if a Prometheus restarts, and takes long enough that the alerts resolve?

matthiasr · 2019-04-26T12:57:32Z

If that's the case, then you need to adjust the `resolve_timeout` anyway.

…

On Fri, Apr 26, 2019 at 12:16 PM Brian Brazil ***@***.***> wrote: That's additional state to manage. What happens if a Prometheus restarts, and takes long enough that the alerts resolve? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1860 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABAEBTAUNHDOXPTUPRDHWDPSLXBJANCNFSM4HIUOGAA> .

brian-brazil · 2019-04-26T13:16:37Z

If that's the case, then you need to adjust the resolve_timeout anyway.

That's not relevant here as Prometheus is what chooses the end time, which is a few evaluation intervals so we could be easily talking less than a minute. Alerts flapping is normal, and we should be robust to it.

prymitive · 2019-04-26T13:25:13Z

Alerts flapping is normal

Doesn't happen frequently from my experiance, as in I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager, it's only a problem when Prometheus is down due to bad config restart. And if that's the case then is that really a blocker for this as it sounds like an unrelated problem (?).

brian-brazil · 2019-04-26T13:32:29Z

I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager

User bug reports indicate otherwise.

And if that's the case then is that really a blocker for this as it sounds like an unrelated problem

It'd affect the reliability of any such solution, as when things would get unsilenced is not predictable.

prymitive · 2019-04-26T13:39:28Z

If there are users who hit flapping issues then they already have a problem, reliability doesn't get any worse then it already is for them. So should that really be a blocker? There's always a corner case for everything.

prymitive · 2019-11-03T14:41:25Z

I wrote a tiny daemon that keep extending silences as long as there are alerts matching them.
This gives me pretty much what I want from acknowledgements.

When alert fires I'll silence it for 10 minutes with ACK! working on this, then the daemon will keep checking all silences where comment starts with ACK!, if they match alerts and would expire soon it will extend them by 15 minutes, if they no longer match any alerts then it will let them expire.

https://github.com/prymitive/kthxbye

roidelapluie · 2020-11-20T13:23:17Z

How do you deal with clustering? Does one alertmanager get to decide to resolve the silence for all the alertmanagers? In such a way that silences would go away if one alertmanager is partitioned from one prometheus servers but not the other AM?

prymitive · 2020-11-20T13:37:18Z

I don't deal with clustering at all.
All I need is an alertmanager api url, whatever that's a single instance or a cluster doesn't really matter.
If it's a cluster and it's in a split brain state then you'll have all problems of a split brain across your entire stack, for every component that uses alertmanager, kthxbye isn't in any way special here.

roidelapluie · 2020-11-20T13:41:51Z

That was a question about how alertmanager could implement this, not a question about kthxbye :)

The issue with not dealing with it is how to debug why a notification is un-silenced in a clustered setup.

prymitive · 2020-11-20T13:59:28Z

My bad, though you were responding to my comment

margau · 2021-10-15T07:49:24Z

I have an proposal regarding this topic:

Add an "Infinite Duration" Switch to the "New Silence" interface
Add an "Expire on resolve" option to silences

The first one allows for a quicker creation of unlimited silences, which shall not be resolved by time, but by manual/automatic actions.

The second one lets a silence automatically expire when at least a single alert of the silenced alerts is resolved.
This could provide the "working on it"-silence-functionality, while also making sure that a reoccuring alert is catched because the silence was expired when the alarm resolved the first time.
Obviously this does not work for flapping alarms, but flapping instances/values could be catched by the Alert Rule definition in a way that the alert itself is not flapping.

What are the thoughts about this?

danpoltawski · 2021-10-15T07:53:06Z

Add an "Expire on resolve" option to silences

Very much like the idea of this option (although it wouldn't be perfect for all cases). We tend to use long-term silences and periodically delete them to do this.

simonpasquier changed the title ~~Feature request: provide a way to acknowlage a firing alert~~ Feature request: provide a way to acknowledge a firing alert May 24, 2019

simonpasquier added the kind/enhancement label Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: provide a way to acknowledge a firing alert #1860

Feature request: provide a way to acknowledge a firing alert #1860

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

stuartnelson3 commented Apr 26, 2019

matthiasr commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019

matthiasr commented Apr 26, 2019

prymitive commented Apr 26, 2019

prymitive commented Apr 26, 2019 •

edited

Loading

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019

matthiasr commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019 via email

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

prymitive commented Nov 3, 2019

roidelapluie commented Nov 20, 2020

prymitive commented Nov 20, 2020

roidelapluie commented Nov 20, 2020

prymitive commented Nov 20, 2020

margau commented Oct 15, 2021

danpoltawski commented Oct 15, 2021

Feature request: provide a way to acknowledge a firing alert #1860

Feature request: provide a way to acknowledge a firing alert #1860

Comments

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

stuartnelson3 commented Apr 26, 2019

matthiasr commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019

matthiasr commented Apr 26, 2019

prymitive commented Apr 26, 2019

prymitive commented Apr 26, 2019 • edited Loading

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019

matthiasr commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

matthiasr commented Apr 26, 2019 via email

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

brian-brazil commented Apr 26, 2019

prymitive commented Apr 26, 2019

prymitive commented Nov 3, 2019

roidelapluie commented Nov 20, 2020

prymitive commented Nov 20, 2020

roidelapluie commented Nov 20, 2020

prymitive commented Nov 20, 2020

margau commented Oct 15, 2021

danpoltawski commented Oct 15, 2021

prymitive commented Apr 26, 2019 •

edited

Loading