-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition with inhibited rules #2229
Comments
That's correct: alerts aren't replicated between Alertmanager instances and this is by design. The replication of notification logs is what you're seeing in the logs. Are you sure that your Prometheus is sending alerts to both Alertmanager pods? |
Hi @simonpasquier, sorry for the delay.
This is my prometheus configuration: global:
scrape_interval: 1m
scrape_timeout: 10s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- prometheus-alertmanager-0.prometheus-alertmanager-headless:9093
- prometheus-alertmanager-1.prometheus-alertmanager-headless:9093
- prometheus-alertmanager-headless:9093
scheme: http
timeout: 10s
api_version: v1
rule_files:
- /etc/config/rules
- /etc/config/alerts Being the prometheus-alertmanager-headless service: apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
component: alertmanager
name: prometheus-alertmanager-headless
spec:
clusterIP: None
ports:
- name: http
port: 80
protocol: TCP
targetPort: 9093
- name: meshpeer
port: 6783
protocol: TCP
targetPort: 6783
selector:
app: prometheus
component: alertmanager
sessionAffinity: None
type: ClusterIP
I don't think i understand this. Are you refering to this line?
What puzzles me is that i see this line only for the HeartBeat alert, but not for the other one. |
I'm not sure why you've configured |
I'm also not sure about that, since i did not build it. If you think that is causing the issue we might try to change it. I can assure you that the alerts are being received, since part of the problem is that they arrive before the inhibition alert. Regards. |
Somehow I've read too quickly your initial report and missed the fact that the "issue" happens only when an Alertmanager instance was recreated... What happens most probably is that Prometheus evaluates the inhibited alert (a) before the inhibiting alert (b) and the new Alertmanager receives (a) before (b). As I said earlier, alerts aren't replicated across Alertmanagers, only notification logs which aren't at play for inhibitions. Are alerts (a) and (b) declared in different rule groups? |
We have various groups configured: route:
receiver: default-receiver
group_by:
- cluster
- service
- deployment
- replicaset
- alertname
- objectid
- alertid
- resourceid
routes:
- receiver: blackhole
match:
severity: blackhole
- receiver: blackhole
(...)
group_wait: 2m
group_interval: 5m
repeat_interval: 1w And alerts have different alertname, alertid and resourceid. This is what you meant by different rule groups? Is there a way to "wait for all alerts to arrive before start firing" ? The (a) and (b) alerts both arrive at some point, is the order in which they arrive that is difficult to control. Thanks for all the help! |
You can increase the |
Hi, Thanks @simonpasquier let me try that and get back to you. Thanks! |
Hi, I tried what you said, but it didn't work. I think something i'm doing might be wrong since the documentation stands:
I set mine to 5m: route:
receiver: default-receiver
group_by:
- cluster
- service
- deployment
- replicaset
- alertname
- objectid
- alertid
- resourceid
routes:
- receiver: blackhole
match:
severity: blackhole
- receiver: blackhole
(...)
group_wait: 5m
group_interval: 5m
repeat_interval: 1w But the alertmanager doesn't wait 5 minutes to send alerts :
Correct me if i'm wrong but i thought the alertmanager should wait the Thanks a lot. |
There's something off with your configuration. I'd check the configuration from the Status page in the UI and make sure that |
Hello, Thanks for the feedback. I was not sure about what you meant so i ended setting a group_wait for each of them (this is from the status page): routes:
- receiver: blackhole
match:
severity: blackhole
- receiver: blackhole
match_re:
support: disable
- receiver: default-receiver-test
match:
severity: test
- receiver: opsgenie
match:
app: opsgenie
group_wait: 5m
group_interval: 5m
repeat_interval: 1m
group_wait: 5m
group_interval: 5m
repeat_interval: 1w The alert we receive (and shouldn't receive) goes through the opsgenie receiver. As you can see both the group_wait and group_interval are set to 5m. This is the logs for the latest test. I think i grasped something else in this occasion. We received two alerts for the same alertid "DONOTFIRE". Those alerts are for instances that become unreachable and in this case the two of them were down.
I don't understand what could happen, we can confirm that in normal circumstances this works. While both alertmanagers are running alerts are inhibited correctly but when one of them go down, it does this. It is also weird to see since the inhibition alert now arrives before the alert that we don't want but it doesn't work. Could you take a look at this? Maybe with this info you can figure something out. Regards. |
Hi, Sorry for the long post, here is an update. I messed up the routes thinking this alert went to the opsgenie resource and i was wrong, this alert was going to the "test" one. My apologies. I changed the resource and it is working now with the 5 minutes group_wait. Tomorrow i will apply this to the other alertmanagers we have. I think this can be closed now. |
@AugerC no problem! thanks for the heads-up :) |
Hi, I'm sorry to reopen the issue, i think the This is my alertmanager.yaml receivers:
- name: default-receiver
<opsgenie configs>
- name: default-receiver-test
<opsgenie configs>
- name: opsgenie
webhook_configs:
<opsgenie heartbeat>
- name: blackhole
route:
group_wait: 5m
group_interval: 5m
receiver: default-receiver
repeat_interval: 168h
group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname', 'objectid', 'alertid', 'resourceid']
routes:
- match:
severity: blackhole
receiver: blackhole
continue: false
- match_re:
support: disable
receiver: blackhole
continue: false
- match:
severity: test
receiver: default-receiver-test
group_interval: 5m
group_wait: 5m
continue: false
- match:
app: opsgenie
receiver: opsgenie
group_interval: 1m
repeat_interval: 1m
continue: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
- source_match:
scheduler: DONOTFIRE
target_match:
schedulername: DONOTFIRE The alertmanager flags:
The service prometheus-alertmanager-headless:
When i kill the alertmanagers, the alerts that are constantly firing on prometheus do not respect the 5m period.
Startup log (alertmanager-0):
I don't understand what is happening on the log trace, i thought that if i set:
Every group that is made should wait 5m to send its alerts to the receiver. Am i wrong? This time i can confirm that the alerts that we receive are going through the route with the 5m group_wait, since that receiver goes to an special team on opsgenie. Let me know if you need more information. Sorry again for the trouble. Regards. |
can you share the full logs? |
Hi, What do you mean? Those are the full logs for the alertmanager-0 pod. I did not omit almost any line, i just changed the names of things (like the alert name). You need the logs for the other alertmanager? they were both starting up at the same time but i think the alertmanager-0 is the one sending the alerts to opsgenie. Thanks a lot. |
Hi, Just to update you, we tested upgrading everything to the latest version:
The problem persists. In this example i deleted just alertmanager-0. Below are the logs from the restarted alertmanager:
I can confirm that inhibition was in fact working (i saw the inhibited alerts on the alertmanager web panel) but the 5 minute group_wait does not work (it seems that the alertmanager does not wait 5m to send alerts). What i noticed this time is that the other alertmanager (alertmanager-1) doesn't seems to rejoin the cluster correctly, it shows this message over and over (alerts are sent to opsgenie at T11:11:56):
It seems that the alertmanager-1, the one i did not restart has trouble rejoining the cluster. I don't know if this is related but i wanted to let you know in case this helps. Thanks a lot. Regards. |
Is this issue forgotten? I've encountered it in 0.22.0. Same story - the order of alerts received from prometheus after restart decides whether alert is inhibited on time(before sending notification). It sounds logical and group_wait should technically prevent this kind of situation. But, as mentioned, group_wait is not respected :( |
I've never solved it actually. The problem persisted and we changed the way we dismissed alerts. I hope this can be resolved someday somehow. |
I made this pr to address the issue: #3167 We are currently using this and it has resolved the problem for us |
I think @simonpasquier is 💯 . When using Prometheus with Alertmanager, the inhibiting rule must be evaluated before the rule it inhibits. You should be able to achieve this in Prometheus adhering to the following rules:
This should mean that alerts from the inhibiting rule are sent to the Alertmanager either before or in the same request as alerts from any other rules it inhibits. There can be occasions where this doesn't happen, for example when the outbound queue of alerts waiting to be sent to Alertmanager is full and so the oldest alerts are dropped. I know that inhibiting is actually related to alerts rather than rules, but I find it helps to think about them as rules. Example 1This example shows an inhibiting rule that needs to inhibit rules across a single group. groups:
- name: Inhibit example 1
rules:
- alert: Inhibiting rule
expr: 1
for: 0s
labels:
inhibit: "true"
annotations:
summary: "This is an inhibiting rule"
- alert: Inhibited rule
expr: 1
for: 5m
labels:
inhibited: "true"
annotations:
summary: "This is an inhibited rule"
Example 2This example shows an inhibiting rule that needs to inhibit rules across multiple groups. Here the inhibiting rule is duplicated for each intended group. This is OK because Alertmanager deduplicates incoming alerts, even if the alerts come from different groups in Prometheus. groups:
- name: Inhibit example 1
rules:
- alert: Inhibiting rule
expr: 1
for: 0s
labels:
inhibit: "true"
annotations:
summary: "This is an inhibiting rule"
- alert: Inhibited rule
expr: 1
for: 5m
labels:
inhibited: "true"
annotations:
summary: "This is an inhibited rule"
- name: Inhibit example 2
rules:
- alert: Inhibiting rule
expr: 1
for: 0s
labels:
inhibit: "true"
annotations:
summary: "This is an inhibiting rule. It is a duplicate of the original from Inhibit example 1"
- alert: Inhibited rule 2
expr: 1
for: 5m
labels:
inhibited: "true"
annotations:
summary: "This is another inhibited rule"
|
Hi,
I would like to ask about a problem we are seeing regarding the inhibition of rules. Let me explain.
We have a kubernetes cluster configured with 1 prometheus server and 2 alertmanagers in cluster mode (HA). The cluster uses preemptible GKE instances.
We have a prometheus rule that fires at certain hours (between 16 and 18). We want that, when that rule is firing, other rules with certain tags do not get sent to opsgenie.
What did you do?
When the alertmanager gets killed by a kubernetes node failure, inhibited alerts are notified.
What did you expect to see?
Inhibited alerts stay inhibited.
What did you see instead? Under which circumstances?
A notification from alertmanager to opsgenie for an inhibited alert. It seems that the alertmanager receives the inhibited alert before the inhibition one. Looking through the log files, we see that the alertmanager doesn't "learn" the firing alerts from the other member of the cluster, but they are received from prometheus instead some time later. Are gossips prioritary?
I think the inhibition rule is correctly configured since the method works while both alertmanagers are active.
We tried setting group_interval or group_wait to a bigger number but doesn't seem to make any differnce.
Environment
Logs from the alertmanager that got killed.
As you can see, at the 14:15:17 mark, the alert for the opsgenie heartbeat (always firing) is learnt from the other peer.
I removed other alerts firing that were not necessary. I changed alerts names, what we have configured in prometheus is:
The text was updated successfully, but these errors were encountered: