Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: provide a way to acknowledge a firing alert #1860

Open
prymitive opened this issue Apr 26, 2019 · 27 comments
Open

Feature request: provide a way to acknowledge a firing alert #1860

prymitive opened this issue Apr 26, 2019 · 27 comments

Comments

@prymitive
Copy link
Contributor

When a whole team tries to take an action on multiple alerts during an incident it requires a lot of communication effort to coordinate who is dealing with which alert.
Some systems (like PD) provide a way to acknowlage an alert and assign it to a specific person (usually the on-call person), but that requires to route every alert via such system. Also during an incident often people try to help and volunteer to handle some of alerts, so the usuall routing of alerts might not cover that.
It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

@brian-brazil
Copy link
Contributor

Human responses to alerts are out of scope for the alertmanager, this is better handled by a system such as PagerDuty. The Alertmanager is just about delivering Prometheus-generated notifications.

@prymitive
Copy link
Contributor Author

Is there any technical limitation that prevents auto-expiring silences from being implemented?
I think that auto-expiring silences are useful as a standalone feature and would be a good enough solution to the acknowlagement problem. Are those also out of scope or just the acknowlagement?

@brian-brazil
Copy link
Contributor

Auto-expiring silences are not wise and would be challenging to implement, see #1057.

See also https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

@brian-brazil
Copy link
Contributor

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

@stuartnelson3
Copy link
Contributor

Cross posting from the above linked email thread:

The only problem with using silences as a replacement is that if you already have plenty of silences (for broken hardware or other issues that take time to resolve) it becomes tricky to find all alerts that got acked but still require an action.
Silences that expire when alert is resolved sound very useful.

This is turning alertmanager into an incident response platform, when it's purpose is to group, deduplicate, and send the notification to the user's incident response provider of choice (pagerduty, ops genie, webhook, etc). To me, acknowledging that an alert has been received and it is currently being addressed makes the most sense at the end of this incident chain (prometheus->alertmanager->provider), rather than having two places to do it that could be out of sync.

(from the email chain)

Sometimes there's a flood of small issues and it's hard to tell who's fixing what just by looking at alerts.

If they are small issues that have been deemed not worthy of paging (i.e. being routed to pagerduty), a user creating the silence and writing in the comment metadata that they're working on it, and then deleting it when finished, seems appropriate. Just because an alert has stopped firing (and in this scenario, expires its silence), doesn't mean that the situation has resolved. Auto-expiring silences could lead to duplicate work more easily than the engineer responsible creating and manually expiring the silence when the alert has been resolved.

It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

I think we already support that with silences, and as stated above, I think making them auto-expire would end up being problematic.

@matthiasr
Copy link

I agree that "acknowledging" is not the right thing to do. On the other hand, I do think that automatically-expiring alerts have their place and are useful.

this would make creating silences in advance impossible

this can easily be solved.

Just because an alert has stopped firing, doesn't mean that the situation has resolved

That is true, but there are many situations where, when a specific alert has stopped firing, this means the situation. I would not make this behavior the default.

manually expiring the silence when the alert has been resolved

I don't want to do things manually that a computer can do for me. Sometimes, I'm not even awake when the situation resolves – say, a job that is failing because a dependency produced garbage data. I'm re-running the dependency, and I expect that once it finishes, the failing job will recover. If that happens, and later it fails again, I want to know immediately, because something else has happened. I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

In #1057 @brian-brazil you said

[this] wouldn't work with AM clustering

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

@brian-brazil
Copy link
Contributor

I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

That's a matter of setting an appropriate expiry time on the silence.

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

I can't remember offhand, but it was probably something to do with a network partition. What happens if one side deletes a silence and the other doesn't?

@matthiasr
Copy link

What would happen if I pressed "expire now" on one side of the partition?

@matthiasr
Copy link

a matter of setting an appropriate expiry time

I don't always know, by a factor of 2-4, how long this will take. I don't want to have to do a whole lot of math either, if I could rather say "whenever it's done is the right time, let me know if it's still an issue tomorrow morning".

@prymitive
Copy link
Contributor Author

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

That's only if all silences auto expire rather than only those with a flag autoExpire: true.

Would it be possible to have ability to set extra annotations from Alertmanager itself that would be added to the firing alert? That way someone could add a note that persists only as long as the alert keeps firing.

@prymitive
Copy link
Contributor Author

prymitive commented Apr 26, 2019

a matter of setting an appropriate expiry time

We do this right now, it sorta works. You silence something for a few hours and it typically is enough. But once in a while you miss issue re-apearing after you thing you fixed it or you set too long expiry time and you forget to unsilence.

@brian-brazil
Copy link
Contributor

let me know if it's still an issue tomorrow morning".

I'd personally set it to tomorrow morning if it could wait, rather than risking waking myself up again.

@matthiasr
Copy link

the key word being "personally". I think this feature would not prevent you from following your style, but it allows others to use a different one that maybe works better for that specific circumstance.

As @prymitive said, this can lead to missing new events.

@matthiasr
Copy link

Pre-creating alerts would work as it does now if the auto-expiry only triggers on the N>0 -> N=0 transition of an alert group, or remembers that it has silenced at least one alert in the past.

@brian-brazil
Copy link
Contributor

That's additional state to manage. What happens if a Prometheus restarts, and takes long enough that the alerts resolve?

@matthiasr
Copy link

matthiasr commented Apr 26, 2019 via email

@brian-brazil
Copy link
Contributor

If that's the case, then you need to adjust the resolve_timeout anyway.

That's not relevant here as Prometheus is what chooses the end time, which is a few evaluation intervals so we could be easily talking less than a minute. Alerts flapping is normal, and we should be robust to it.

@prymitive
Copy link
Contributor Author

Alerts flapping is normal

Doesn't happen frequently from my experiance, as in I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager, it's only a problem when Prometheus is down due to bad config restart. And if that's the case then is that really a blocker for this as it sounds like an unrelated problem (?).

@brian-brazil
Copy link
Contributor

I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager

User bug reports indicate otherwise.

And if that's the case then is that really a blocker for this as it sounds like an unrelated problem

It'd affect the reliability of any such solution, as when things would get unsilenced is not predictable.

@prymitive
Copy link
Contributor Author

If there are users who hit flapping issues then they already have a problem, reliability doesn't get any worse then it already is for them. So should that really be a blocker? There's always a corner case for everything.

@simonpasquier simonpasquier changed the title Feature request: provide a way to acknowlage a firing alert Feature request: provide a way to acknowledge a firing alert May 24, 2019
@prymitive
Copy link
Contributor Author

I wrote a tiny daemon that keep extending silences as long as there are alerts matching them.
This gives me pretty much what I want from acknowledgements.

When alert fires I'll silence it for 10 minutes with ACK! working on this, then the daemon will keep checking all silences where comment starts with ACK!, if they match alerts and would expire soon it will extend them by 15 minutes, if they no longer match any alerts then it will let them expire.

https://github.com/prymitive/kthxbye

@roidelapluie
Copy link
Member

How do you deal with clustering? Does one alertmanager get to decide to resolve the silence for all the alertmanagers? In such a way that silences would go away if one alertmanager is partitioned from one prometheus servers but not the other AM?

@prymitive
Copy link
Contributor Author

I don't deal with clustering at all.
All I need is an alertmanager api url, whatever that's a single instance or a cluster doesn't really matter.
If it's a cluster and it's in a split brain state then you'll have all problems of a split brain across your entire stack, for every component that uses alertmanager, kthxbye isn't in any way special here.

@roidelapluie
Copy link
Member

That was a question about how alertmanager could implement this, not a question about kthxbye :)

The issue with not dealing with it is how to debug why a notification is un-silenced in a clustered setup.

@prymitive
Copy link
Contributor Author

My bad, though you were responding to my comment

@margau
Copy link

margau commented Oct 15, 2021

I have an proposal regarding this topic:

  • Add an "Infinite Duration" Switch to the "New Silence" interface
  • Add an "Expire on resolve" option to silences

The first one allows for a quicker creation of unlimited silences, which shall not be resolved by time, but by manual/automatic actions.

The second one lets a silence automatically expire when at least a single alert of the silenced alerts is resolved.
This could provide the "working on it"-silence-functionality, while also making sure that a reoccuring alert is catched because the silence was expired when the alarm resolved the first time.
Obviously this does not work for flapping alarms, but flapping instances/values could be catched by the Alert Rule definition in a way that the alert itself is not flapping.

What are the thoughts about this?

@danpoltawski
Copy link

Add an "Expire on resolve" option to silences

Very much like the idea of this option (although it wouldn't be perfect for all cases). We tend to use long-term silences and periodically delete them to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants