Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with inhibited rules #2229

Open
gammore opened this issue Apr 7, 2020 · 21 comments
Open

Race condition with inhibited rules #2229

gammore opened this issue Apr 7, 2020 · 21 comments

Comments

@gammore
Copy link

gammore commented Apr 7, 2020

Hi,

I would like to ask about a problem we are seeing regarding the inhibition of rules. Let me explain.

We have a kubernetes cluster configured with 1 prometheus server and 2 alertmanagers in cluster mode (HA). The cluster uses preemptible GKE instances.

We have a prometheus rule that fires at certain hours (between 16 and 18). We want that, when that rule is firing, other rules with certain tags do not get sent to opsgenie.

What did you do?
When the alertmanager gets killed by a kubernetes node failure, inhibited alerts are notified.

What did you expect to see?
Inhibited alerts stay inhibited.

What did you see instead? Under which circumstances?
A notification from alertmanager to opsgenie for an inhibited alert. It seems that the alertmanager receives the inhibited alert before the inhibition one. Looking through the log files, we see that the alertmanager doesn't "learn" the firing alerts from the other member of the cluster, but they are received from prometheus instead some time later. Are gossips prioritary?

I think the inhibition rule is correctly configured since the method works while both alertmanagers are active.

We tried setting group_interval or group_wait to a bigger number but doesn't seem to make any differnce.

Environment

  • System information:

Linux 4.14.137+ x86_64

  • Alertmanager version:

alertmanager, version 0.19.0 (branch: HEAD, revision: 7aa5d19)
build user: root@587d0268f963
build date: 20190903-15:01:40
go version: go1.12.8

  • Prometheus version:

prometheus, version 2.13.1 (branch: HEAD, revision: 6f92ce56053866194ae5937012c1bec40f1dd1d9)
build user: root@88e419aa1676
build date: 20191017-13:15:01
go version: go1.13.1

  • Alertmanager configuration file:
receivers:
  - name: default-receiver
    opsgenie_configs:
	(...)
  - name: blackhole
route:
  group_wait: 120s
  group_interval: 5m
  receiver: default-receiver
  repeat_interval: 168h
  group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname', 'objectid', 'alertid', 'resourceid']
  routes:
    #Label with the value blackhole as severity are not sent to Opsgenie
    #Usualy used for prometheus/alertmanager tricks(ex: build dependy for some proxy exporters)
    - match:
        severity: blackhole
      receiver: blackhole
      continue: false
    #if support if set to disable don't sent notification
    - match_re:
        support: disable
      receiver: blackhole
      continue: false
    - match:
        severity: test
      receiver: default-receiver-test
      continue: false
    - match:
        app: opsgenie
      receiver: opsgenie
      repeat_interval: 1m
      group_interval: 1m
      continue: false
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']
# This is the inhibition rule, an alert with the tag quiethours gets fired at 16  until 18 )
- source_match:
    alertname: QuietHours
  target_match:
    type: "quiethours"
  • Alertmanager deployment file:
- name: prometheus-alertmanager
  image: "prom/alertmanager:v0.19.0"
  imagePullPolicy: "IfNotPresent"
  args:
    - --config.file=/etc/config/alertmanager.yml
    - --storage.path=/data
    - --log.level=debug
    - --cluster.settle-timeout=2m
    - --web.external-url=(...)
    - --cluster.listen-address=0.0.0.0:11677
    - --cluster.peer=<alertmanager0-kubernetes dns>
    - --cluster.peer=<alertmanager1-kubernetes dns>
  ports:
    - containerPort: 9093
  • Logs:
    Logs from the alertmanager that got killed.

As you can see, at the 14:15:17 mark, the alert for the opsgenie heartbeat (always firing) is learnt from the other peer.

level=info ts=2020-04-01T14:14:54.223Z caller=main.go:217 msg="Starting Alertmanager" version="(version=0.19.0, branch=HEAD, revision=7aa5d19fea3f58e3d27dbdeb0f2883037168914a)"
level=info ts=2020-04-01T14:14:54.223Z caller=main.go:218 build_context="(go=go1.12.8, user=root@587d0268f963, date=20190903-15:01:40)"
level=debug ts=2020-04-01T14:14:54.422Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=10.171.19.26:19604,10.171.16.110:19604
level=info ts=2020-04-01T14:14:54.423Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.171.16.110 port=19604
level=debug ts=2020-04-01T14:14:54.622Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01E4V1GK6QA4W11YX2RFCZ0MDP addr=10.171.16.110:19604
level=debug ts=2020-04-01T14:14:54.726Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 14:14:54 [DEBUG] memberlist: Initiating push/pull sync with: 10.171.19.26:19604\n"
level=debug ts=2020-04-01T14:14:54.728Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01E4SZ9ADGPQYACVAAXY1JVRM1 addr=10.171.19.26:19604
level=debug ts=2020-04-01T14:14:54.728Z caller=cluster.go:470 component=cluster msg="peer rejoined" peer=01E4SZ9ADGPQYACVAAXY1JVRM1
level=debug ts=2020-04-01T14:14:54.922Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 14:14:54 [DEBUG] memberlist: Stream connection from=10.171.16.110:58314\n"
level=debug ts=2020-04-01T14:14:54.922Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 14:14:54 [DEBUG] memberlist: Initiating push/pull sync with: 10.171.16.110:19604\n"
level=debug ts=2020-04-01T14:14:54.923Z caller=cluster.go:233 component=cluster msg="joined cluster" peers=2
level=info ts=2020-04-01T14:14:55.022Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-04-01T14:14:57.023Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000956057s
level=debug ts=2020-04-01T14:14:59.222Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.199898351s
level=debug ts=2020-04-01T14:15:01.522Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.499954464s
level=info ts=2020-04-01T14:15:01.922Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=info ts=2020-04-01T14:15:02.222Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/config/alertmanager.yml
level=warn ts=2020-04-01T14:15:02.623Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route={}
level=warn ts=2020-04-01T14:15:02.623Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"blackhole\"}"
level=warn ts=2020-04-01T14:15:02.623Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{support=~\"^(?:disable)$\"}"
level=warn ts=2020-04-01T14:15:02.623Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"test\"}"
level=info ts=2020-04-01T14:15:02.722Z caller=main.go:466 msg=Listening address=:9093
level=debug ts=2020-04-01T14:15:03.522Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.500316247s
level=info ts=2020-04-01T14:15:05.524Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.501884671s

# Opsgenie heartbeat alert learnt from peer
level=debug ts=2020-04-01T14:15:17.122Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{app=\\\"opsgenie\\\"}:{alertname=\\\"Prometheus OpsGenie HeartBeat\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"webhook\" > timestamp:<seconds:1585750516 nanos:935480882 > firing_alerts:11023415806717871155 > expires_at:<seconds:1586182516 nanos:935480882 > "

# Inhibited alert gets logged but flushed
level=debug ts=2020-04-01T14:15:26.230Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="This alert should not fire [4c6d7e9][active]"
level=debug ts=2020-04-01T14:15:26.230Z caller=dispatch.go:432 component=dispatcher aggrGroup="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire \", resourceid=\"DONOTFIRE"}" msg=flushing alerts="[This alert should not fire [4c6d7e9][active]]"
level=debug ts=2020-04-01T14:15:37.735Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 14:15:37 [DEBUG] memberlist: Stream connection from=10.171.19.26:49564\n"

# Alert is sent to opsgenie
level=debug ts=2020-04-01T14:15:41.429Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire\", resourceid=\"DONOTFIRE\"}"

# The inhibition alert (this one has the tag "alertname=quiethours"
level=debug ts=2020-04-01T14:15:42.324Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="Scheduler for DONOTFIRE[2527f9a][active]"
level=debug ts=2020-04-01T14:15:42.423Z caller=dispatch.go:432 component=dispatcher aggrGroup="{}/{severity=\"blackhole\"}:{alertname=\"Scheduler for DONOTFIRE\"}" msg=flushing alerts="[Scheduler for DONOTFIRE[2527f9a][active]]"
level=debug ts=2020-04-01T14:15:56.204Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="This alert should not fire[4c6d7e9][active]"

I removed other alerts firing that were not necessary. I changed alerts names, what we have configured in prometheus is:

alertid: ALERTID
alertname: This alert should not fire
resourceid: DONOTFIRE
type: "quiethours"
alertid: XXXX
alertname: Scheduler for DONOTFIRE
alertname: QuietHours
expression: < fire everyday between 16 and 18 >
@simonpasquier
Copy link
Member

Looking through the log files, we see that the alertmanager doesn't "learn" the firing alerts from the other member of the cluster, but they are received from prometheus instead some time later. Are gossips prioritary?

That's correct: alerts aren't replicated between Alertmanager instances and this is by design. The replication of notification logs is what you're seeing in the logs.

Are you sure that your Prometheus is sending alerts to both Alertmanager pods?

@gammore
Copy link
Author

gammore commented Apr 15, 2020

Hi @simonpasquier, sorry for the delay.

Are you sure that your Prometheus is sending alerts to both Alertmanager pods?

This is my prometheus configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 10s
  evaluation_interval: 15s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - prometheus-alertmanager-0.prometheus-alertmanager-headless:9093
      - prometheus-alertmanager-1.prometheus-alertmanager-headless:9093
      - prometheus-alertmanager-headless:9093
    scheme: http
    timeout: 10s
    api_version: v1
rule_files:
- /etc/config/rules
- /etc/config/alerts

Being the prometheus-alertmanager-headless service:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
    component: alertmanager
  name: prometheus-alertmanager-headless
spec:
  clusterIP: None
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 9093
  - name: meshpeer
    port: 6783
    protocol: TCP
    targetPort: 6783
  selector:
    app: prometheus
    component: alertmanager
  sessionAffinity: None
  type: ClusterIP

That's correct: alerts aren't replicated between Alertmanager instances and this is by design. The replication of notification logs is what you're seeing in the logs.

I don't think i understand this. Are you refering to this line?

level=debug ts=2020-04-01T14:15:17.122Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{app=\\\"opsgenie\\\"}:{alertname=\\\"Prometheus OpsGenie HeartBeat\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"webhook\" > timestamp:<seconds:1585750516 nanos:935480882 > firing_alerts:11023415806717871155 > expires_at:<seconds:1586182516 nanos:935480882 > "

What puzzles me is that i see this line only for the HeartBeat alert, but not for the other one.

@simonpasquier
Copy link
Member

I'm not sure why you've configured prometheus-alertmanager-headless in addition to prometheus-alertmanager-0.prometheus-alertmanager-headless and prometheus-alertmanager-1.prometheus-alertmanager-headless.

@gammore
Copy link
Author

gammore commented Apr 16, 2020

I'm also not sure about that, since i did not build it. If you think that is causing the issue we might try to change it.

I can assure you that the alerts are being received, since part of the problem is that they arrive before the inhibition alert.

Regards.

@simonpasquier
Copy link
Member

Somehow I've read too quickly your initial report and missed the fact that the "issue" happens only when an Alertmanager instance was recreated...

What happens most probably is that Prometheus evaluates the inhibited alert (a) before the inhibiting alert (b) and the new Alertmanager receives (a) before (b). As I said earlier, alerts aren't replicated across Alertmanagers, only notification logs which aren't at play for inhibitions. Are alerts (a) and (b) declared in different rule groups?

@gammore
Copy link
Author

gammore commented Apr 17, 2020

Somehow I've read too quickly your initial report and missed the fact that the "issue" happens only when an Alertmanager instance was recreated...

What happens most probably is that Prometheus evaluates the inhibited alert (a) before the inhibiting alert (b) and the new Alertmanager receives (a) before (b). As I said earlier, alerts aren't replicated across Alertmanagers, only notification logs which aren't at play for inhibitions. Are alerts (a) and (b) declared in different rule groups?

We have various groups configured:

route:
  receiver: default-receiver
  group_by:
  - cluster
  - service
  - deployment
  - replicaset
  - alertname
  - objectid
  - alertid
  - resourceid
  routes:
  - receiver: blackhole
    match:
      severity: blackhole
  - receiver: blackhole
(...)
  group_wait: 2m
  group_interval: 5m
  repeat_interval: 1w

And alerts have different alertname, alertid and resourceid. This is what you meant by different rule groups?

Is there a way to "wait for all alerts to arrive before start firing" ? The (a) and (b) alerts both arrive at some point, is the order in which they arrive that is difficult to control.

Thanks for all the help!

@simonpasquier
Copy link
Member

You can increase the group_wait interval to a value that is large enough so you're guaranteed to receive both alert (a) and alert (b).

@gammore
Copy link
Author

gammore commented Apr 23, 2020

Hi,

Thanks @simonpasquier let me try that and get back to you.

Thanks!

@gammore
Copy link
Author

gammore commented Apr 23, 2020

Hi,

I tried what you said, but it didn't work. I think something i'm doing might be wrong since the documentation stands:

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

I set mine to 5m:

route:
  receiver: default-receiver
  group_by:
  - cluster
  - service
  - deployment
  - replicaset
  - alertname
  - objectid
  - alertid
  - resourceid
  routes:
  - receiver: blackhole
    match:
      severity: blackhole
  - receiver: blackhole
(...)
  group_wait: 5m
  group_interval: 5m
  repeat_interval: 1w

But the alertmanager doesn't wait 5 minutes to send alerts :

(... Starting up ...)
level=debug ts=2020-04-23T14:23:54.021Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.399673249s
level=info ts=2020-04-23T14:23:56.120Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.499163012s
level=debug ts=2020-04-23T14:23:59.735Z caller=cluster.go:306 component=cluster memberlist="2020/04/23 14:23:59 [DEBUG] memberlist: Stream connection from=10.171.21.7:60878\n"

# ~30s later the inhibited alert arrives (the alert that we don't want)
#   It gets flushed right away
level=debug ts=2020-04-23T14:24:26.420Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="This alert should not fire[292595e][active]"
level=debug ts=2020-04-23T14:24:26.421Z caller=dispatch.go:432 component=dispatcher aggrGroup="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire\", resourceid=\"DONOTFIRE\"}" msg=flushing alerts="[This alert should not[035ccee][active]]"

# And 20 seconds later it is sent to opsgenie:
level=debug ts=2020-04-23T14:24:41.421Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire\", resourceid=\"DONOTFIRE\"}"

# About 10s later we receive the inhibition alert, but the notification is already sent.

Correct me if i'm wrong but i thought the alertmanager should wait the group_wait time to send all alerts. Is that correct?

Thanks a lot.

@simonpasquier
Copy link
Member

There's something off with your configuration. I'd check the configuration from the Status page in the UI and make sure that group_wait is 5m for the route.

@gammore
Copy link
Author

gammore commented May 27, 2020

Hello,

Thanks for the feedback. I was not sure about what you meant so i ended setting a group_wait for each of them (this is from the status page):

  routes:
  - receiver: blackhole
    match:
      severity: blackhole
  - receiver: blackhole
    match_re:
      support: disable
  - receiver: default-receiver-test
    match:
      severity: test
  - receiver: opsgenie
    match:
      app: opsgenie
    group_wait: 5m
    group_interval: 5m
    repeat_interval: 1m
  group_wait: 5m
  group_interval: 5m
  repeat_interval: 1w

The alert we receive (and shouldn't receive) goes through the opsgenie receiver. As you can see both the group_wait and group_interval are set to 5m.

This is the logs for the latest test. I think i grasped something else in this occasion.

We received two alerts for the same alertid "DONOTFIRE". Those alerts are for instances that become unreachable and in this case the two of them were down.
They appear on the logs with id's 292595e and 035ccee.

# STARTUP
level=info ts=2020-05-27T14:17:40.077Z caller=main.go:217 msg="Starting Alertmanager" version="(version=0.19.0, branch=HEAD, revision=7aa5d19fea3f58e3d27dbdeb0f2883037168914a)"
level=info ts=2020-05-27T14:17:40.077Z caller=main.go:218 build_context="(go=go1.12.8, user=root@587d0268f963, date=20190903-15:01:40)"
level=debug ts=2020-05-27T14:17:40.277Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=10.171.17.4:18086,10.171.18.10:18086
level=info ts=2020-05-27T14:17:40.278Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.171.17.4 port=18086
level=debug ts=2020-05-27T14:17:40.476Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01E9B7XX8RM44DPV2WPAKX39Z8 addr=10.171.17.4:18086
level=debug ts=2020-05-27T14:17:40.477Z caller=cluster.go:306 component=cluster memberlist="2020/05/27 14:17:40 [DEBUG] memberlist: Stream connection from=10.171.17.4:60666\n"
level=debug ts=2020-05-27T14:17:40.576Z caller=cluster.go:306 component=cluster memberlist="2020/05/27 14:17:40 [DEBUG] memberlist: Initiating push/pull sync with: 10.171.17.4:18086\n"
level=debug ts=2020-05-27T14:17:40.676Z caller=cluster.go:306 component=cluster memberlist="2020/05/27 14:17:40 [DEBUG] memberlist: Initiating push/pull sync with: 10.171.18.10:18086\n"
level=debug ts=2020-05-27T14:17:40.677Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01E9B6S1D42C9SRJ2S9YCMZ1VW addr=10.171.18.10:18086
level=debug ts=2020-05-27T14:17:40.677Z caller=cluster.go:470 component=cluster msg="peer rejoined" peer=01E9B6S1D42C9SRJ2S9YCMZ1VW
level=debug ts=2020-05-27T14:17:40.677Z caller=cluster.go:233 component=cluster msg="joined cluster" peers=2
level=info ts=2020-05-27T14:17:40.976Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-05-27T14:17:42.976Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000101095s
level=info ts=2020-05-27T14:17:45.378Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=info ts=2020-05-27T14:17:45.477Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/config/alertmanager.yml
level=warn ts=2020-05-27T14:17:45.578Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route={}
level=warn ts=2020-05-27T14:17:45.676Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"blackhole\"}"
level=warn ts=2020-05-27T14:17:45.676Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{support=~\"^(?:disable)$\"}"
level=warn ts=2020-05-27T14:17:45.676Z caller=main.go:421 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"test\"}"
level=debug ts=2020-05-27T14:17:45.677Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.700707947s
level=debug ts=2020-05-27T14:17:46.585Z caller=cluster.go:306 component=cluster memberlist="2020/05/27 14:17:46 [DEBUG] memberlist: Stream connection from=10.171.18.10:41006\n"
level=debug ts=2020-05-27T14:17:47.677Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.700861376s
level=debug ts=2020-05-27T14:17:49.677Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.701000597s
level=info ts=2020-05-27T14:17:51.678Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.701161799s

# (.....) We receive two alerts for that (This is an alert that notifies unreachable instances, in this case two of them are down so we get 2)
level=debug ts=2020-05-27T14:18:11.478Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="This alert should not fire 1[292595e][active]"
level=debug ts=2020-05-27T14:18:11.478Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="This alert should not fire 2[035ccee][active]"

# (....) The first one gets flushed right away (292595e)
level=debug ts=2020-05-27T14:18:11.877Z caller=dispatch.go:432 component=dispatcher aggrGroup="{}:{alertid=\"ALERTID\", alertname=\""This alert should not fire\", resourceid=\"RESOURCEID1\"}" msg=flushing alerts="[AWS EC2 Standalone Instance is Unreachable for .*TEST.*[292595e][active]]"

# (...) Moments later we receive the inhibition alert that goes to the blackhole as expected
level=debug ts=2020-05-27T14:18:12.277Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert="Scheduler for DONOTFIRE [2527f9a][active]"

# (...) The other alert (035ccee) gets also flushed
level=debug ts=2020-05-27T14:18:12.477Z caller=dispatch.go:432 component=dispatcher aggrGroup="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire\", resourceid=\"RESOURCEID2\"}" msg=flushing alerts="[This alert should not fire[035ccee][active]]"

# (...) And then, the first alert is sent to opsgenie
level=debug ts=2020-05-27T14:18:26.977Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"ALERTID\", alertname=\"This alert should not fire\", resourceid=\"RESOURCEID1\"}"

I don't understand what could happen, we can confirm that in normal circumstances this works. While both alertmanagers are running alerts are inhibited correctly but when one of them go down, it does this.

It is also weird to see since the inhibition alert now arrives before the alert that we don't want but it doesn't work.

Could you take a look at this? Maybe with this info you can figure something out.
I'm happy to make any test you need.

Regards.

@gammore
Copy link
Author

gammore commented May 27, 2020

Hi,

Sorry for the long post, here is an update. I messed up the routes thinking this alert went to the opsgenie resource and i was wrong, this alert was going to the "test" one. My apologies.

I changed the resource and it is working now with the 5 minutes group_wait. Tomorrow i will apply this to the other alertmanagers we have.

I think this can be closed now.

@gammore gammore closed this as completed May 27, 2020
@simonpasquier
Copy link
Member

@AugerC no problem! thanks for the heads-up :)

@gammore
Copy link
Author

gammore commented Jun 23, 2020

Hi,

I'm sorry to reopen the issue, i think the group_wait is not working as expected when both the alertmanager die.

This is my alertmanager.yaml
I checked the yaml against the alertmanager "status" page and it is being applied.

receivers:
  - name: default-receiver
    <opsgenie configs>
  - name: default-receiver-test
    <opsgenie configs>
  - name: opsgenie
    webhook_configs:
    <opsgenie heartbeat>
  - name: blackhole
route:
  group_wait: 5m
  group_interval: 5m
  receiver: default-receiver
  repeat_interval: 168h
  group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname', 'objectid', 'alertid', 'resourceid']
  routes:
    - match:
        severity: blackhole
      receiver: blackhole
      continue: false
    - match_re:
        support: disable
      receiver: blackhole
      continue: false
    - match:
        severity: test
      receiver: default-receiver-test
      group_interval: 5m
      group_wait: 5m
      continue: false
    - match:
        app: opsgenie
      receiver: opsgenie
      group_interval: 1m
      repeat_interval: 1m
      continue: false
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
- source_match:
    scheduler: DONOTFIRE
  target_match:
    schedulername: DONOTFIRE

The alertmanager flags:

      --config.file=/etc/config/alertmanager.yml
      --storage.path=/data
      --log.level=debug
      --web.external-url=<url>
      --cluster.listen-address=0.0.0.0:11746
      --cluster.settle-timeout=5m
      --cluster.reconnect-interval=3s
      --cluster.pushpull-interval=30s
      --cluster.peer=prometheus-alertmanager-0.prometheus-alertmanager-headless.<namespace>.svc.cluster.local:11746
      --cluster.peer=prometheus-alertmanager-1.prometheus-alertmanager-headless.<namespace>.svc.cluster.local:11746

The service prometheus-alertmanager-headless:

 > kubectl describe svc prometheus-alertmanager-headless
Name:              prometheus-alertmanager-headless
Namespace:         <namespace>
Labels:            	app=prometheus
                   		component=alertmanager
                   		release=<release>
Annotations:       <none>
Selector:          app=prometheus,component=alertmanager,release=<release>
Type:                  ClusterIP
IP:                	      None
Port:                   http  80/TCP
TargetPort:         9093/TCP
Endpoints:          10.171.6.27:9093,10.171.8.32:9093
Port:                   meshpeer  6783/TCP
TargetPort:          6783/TCP
Endpoints:         10.171.6.27:6783,10.171.8.32:6783
Session Affinity:  None
Events:              <none>

When i kill the alertmanagers, the alerts that are constantly firing on prometheus do not respect the 5m period.

kubectl delete pod prometheus-alertmanager-0 prometheus-alertmanager-1

Startup log (alertmanager-0):

level=info ts=2020-06-23T07:43:09.223Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2020-06-23T07:43:09.223Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=debug ts=2020-06-23T07:43:09.425Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=10.171.8.30:11693,10.171.6.26:11693
level=info ts=2020-06-23T07:43:09.623Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.171.8.30 port=11693
level=debug ts=2020-06-23T07:43:09.624Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01EBG22Y9QVSWP31AZV9QKQ534 addr=10.171.8.30:11693
level=debug ts=2020-06-23T07:43:09.723Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:09 [DEBUG] memberlist: Initiating push/pull sync with:  10.171.8.30:11693\n"
level=debug ts=2020-06-23T07:43:09.723Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:09 [DEBUG] memberlist: Stream connection from=10.171.8.30:50760\n"
level=debug ts=2020-06-23T07:43:09.925Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:09 [DEBUG] memberlist: Initiating push/pull sync with:  10.171.6.26:11693\n"
level=debug ts=2020-06-23T07:43:09.926Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01EBG1GFS629J5TQVFB9AMRZEY addr=10.171.6.26:11693
level=debug ts=2020-06-23T07:43:10.023Z caller=cluster.go:470 component=cluster msg="peer rejoined" peer=01EBG1GFS629J5TQVFB9AMRZEY
level=debug ts=2020-06-23T07:43:10.024Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{app=\\\"opsgenie\\\"}:{alertname=\\\"Prometheus OpsGenie HeartBeat\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"webhook\" > timestamp:<seconds:1592898161 nanos:735099400 > firing_alerts:11023415806717871155 > expires_at:<seconds:1593330161 nanos:735099400 > "
level=debug ts=2020-06-23T07:43:10.026Z caller=cluster.go:233 component=cluster msg="joined cluster" peers=2
level=info ts=2020-06-23T07:43:10.026Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-06-23T07:43:12.124Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.097914769s
level=debug ts=2020-06-23T07:43:14.323Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.29614162s
level=debug ts=2020-06-23T07:43:16.124Z caller=main.go:355 externalURL=<url>
level=info ts=2020-06-23T07:43:16.124Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=info ts=2020-06-23T07:43:16.323Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/config/alertmanager.yml
level=debug ts=2020-06-23T07:43:16.326Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.299865568s
level=warn ts=2020-06-23T07:43:16.823Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route={}
level=warn ts=2020-06-23T07:43:16.824Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"blackhole\"}"
level=warn ts=2020-06-23T07:43:16.825Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{support=~\"^(?:disable)$\"}"
level=info ts=2020-06-23T07:43:16.833Z caller=main.go:485 msg=Listening address=:9093
level=debug ts=2020-06-23T07:43:18.140Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:18 [DEBUG] memberlist: Stream connection from=10.171.6.26:51542\n"
level=debug ts=2020-06-23T07:43:18.328Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.301724693s
level=info ts=2020-06-23T07:43:20.330Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.303722177s

# A ping against the other peer fails since it is also restarting
level=debug ts=2020-06-23T07:43:31.227Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:31 [DEBUG] memberlist: Failed ping: 01EBG1GFS629J5TQVFB9AMRZEY (timeout reached)\n"
level=debug ts=2020-06-23T07:43:31.631Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:31 [ERR] memberlist: Failed fallback ping: read tcp 10.171.8.30:45898->10.171.6.26:11693: i/o timeout\n"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[e38ebb7][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[c170d2d][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[d9869b8][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[72245f3][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[7d13c4b][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Prometheus OpsGenie HeartBeat[f4b6014][active]"
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{app=\"opsgenie\"}:{alertname=\"Prometheus OpsGenie HeartBeat\"}" msg=flushing alerts="[Prometheus OpsGenie HeartBeat[f4b6014][active]]"

#Logged 5 times for all 5 alerts c170d2d, d9869b8, 72245f3, 7d13c4b, each one in a different aggrGroup
level=debug ts=2020-06-23T07:43:41.349Z caller=dispatch.go:72245f3473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource1\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[e38ebb7][active]]" 

# Inhibition alert
level=debug ts=2020-06-23T07:43:42.245Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Scheduler for TEST.*[6f27987][active]"
level=debug ts=2020-06-23T07:43:42.246Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"blackhole\"}:{alertname=\"Scheduler for TEST.*\"}" msg=flushing alerts="[Scheduler for TEST.*[6f27987][active]]"

level=debug ts=2020-06-23T07:43:42.335Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{app=\\\"opsgenie\\\"}:{alertname=\\\"Prometheus OpsGenie HeartBeat\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"webhook\" > timestamp:<seconds:1592898222 nanos:134704667 > firing_alerts:11023415806717871155 > expires_at:<seconds:1593330222 nanos:134704667 > "
level=debug ts=2020-06-23T07:43:48.144Z caller=cluster.go:306 component=cluster memberlist="2020/06/23 07:43:48 [DEBUG] memberlist: Stream connection from=10.171.6.26:52262\n"

# Alerts are sent to opsgenie less than 5 minutes after startup. The other alertmanager does not send the alert to opsgenie.
level=debug ts=2020-06-23T07:43:56.350Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource1\"}"
level=debug ts=2020-06-23T07:43:56.424Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource2\"}"
level=debug ts=2020-06-23T07:43:56.424Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource3\"}"
level=debug ts=2020-06-23T07:43:56.424Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource4\"}"
level=debug ts=2020-06-23T07:43:56.623Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource5\"}"

I don't understand what is happening on the log trace, i thought that if i set:

    - match:
        severity: test
      receiver: default-receiver-test
      group_interval: 5m
      group_wait: 5m
      continue: false

Every group that is made should wait 5m to send its alerts to the receiver. Am i wrong?

This time i can confirm that the alerts that we receive are going through the route with the 5m group_wait, since that receiver goes to an special team on opsgenie.

Let me know if you need more information. Sorry again for the trouble.

Regards.

@gammore gammore reopened this Jun 23, 2020
@simonpasquier
Copy link
Member

can you share the full logs?

@gammore
Copy link
Author

gammore commented Jun 25, 2020

Hi,

What do you mean? Those are the full logs for the alertmanager-0 pod. I did not omit almost any line, i just changed the names of things (like the alert name).

You need the logs for the other alertmanager? they were both starting up at the same time but i think the alertmanager-0 is the one sending the alerts to opsgenie.

Thanks a lot.

@gammore
Copy link
Author

gammore commented Jul 15, 2020

Hi,

Just to update you, we tested upgrading everything to the latest version:

  • alertmanager 0.21.0
  • prometheus 2.19.2

The problem persists. In this example i deleted just alertmanager-0. Below are the logs from the restarted alertmanager:
alertmanager-0 - logs

level=info ts=2020-07-15T11:10:48.973Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2020-07-15T11:10:48.974Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=debug ts=2020-07-15T11:10:49.173Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=10.171.4.13:18976,10.171.9.4:18976
level=info ts=2020-07-15T11:10:49.276Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.171.4.13 port=18976
level=debug ts=2020-07-15T11:10:49.473Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01ED92PZXW18SXNAP176VTJ8D6 addr=10.171.4.13:18976
level=debug ts=2020-07-15T11:10:49.573Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:10:49 [DEBUG] memberlist: Initiating push/pull sync with:  10.171.4.13:18976\n"
level=debug ts=2020-07-15T11:10:49.573Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:10:49 [DEBUG] memberlist: Stream connection from=10.171.4.13:38102\n"
level=debug ts=2020-07-15T11:10:49.774Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:10:49 [DEBUG] memberlist: Initiating push/pull sync with:  10.171.9.4:18976\n"
level=debug ts=2020-07-15T11:10:49.775Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01ED92CC2APDCJ94WGHRPSJFGF addr=10.171.9.4:18976
level=debug ts=2020-07-15T11:10:49.775Z caller=cluster.go:470 component=cluster msg="peer rejoined" peer=01ED92CC2APDCJ94WGHRPSJFGF
level=debug ts=2020-07-15T11:10:49.776Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{app=\\\"opsgenie\\\"}:{alertname=\\\"Prometheus OpsGenie HeartBeat\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"webhook\" > timestamp:<seconds:1594811445 nanos:214507143 > firing_alerts:11023415806717871155 > expires_at:<seconds:1595243445 nanos:214507143 > "
level=debug ts=2020-07-15T11:10:49.776Z caller=cluster.go:233 component=cluster msg="joined cluster" peers=2
level=info ts=2020-07-15T11:10:49.973Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-07-15T11:10:51.974Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000906042s
level=debug ts=2020-07-15T11:10:54.075Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.101919725s
level=debug ts=2020-07-15T11:10:54.075Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:10:54 [DEBUG] memberlist: Stream connection from=10.171.9.4:45792\n"
level=debug ts=2020-07-15T11:10:56.274Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.300976912s
level=debug ts=2020-07-15T11:10:57.873Z caller=main.go:355 externalURL=<url>
level=info ts=2020-07-15T11:10:57.879Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=debug ts=2020-07-15T11:10:58.283Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.309942198s
level=info ts=2020-07-15T11:10:58.285Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/config/alertmanager.yml
level=warn ts=2020-07-15T11:10:58.774Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route={}
level=warn ts=2020-07-15T11:10:58.774Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{severity=\"blackhole\"}"
level=warn ts=2020-07-15T11:10:58.873Z caller=main.go:437 component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=168h0m0s retention=120h0m0s route="{}/{support=~\"^(?:disable)$\"}"
level=info ts=2020-07-15T11:10:58.874Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2020-07-15T11:11:00.284Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.310761363s



# Init looks ok. This is the moment the alerts are received by the alertmanager. On prometheus, these alerts are always firing. 
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[e38ebb7][active]"
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[7d13c4b][active]"
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[d9869b8][active]"
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[c170d2d][active]"
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[72245f3][active]"
level=debug ts=2020-07-15T11:11:41.575Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Prometheus OpsGenie HeartBeat[f4b6014][active]"
level=debug ts=2020-07-15T11:11:41.576Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{app=\"opsgenie\"}:{alertname=\"Prometheus OpsGenie HeartBeat\"}" msg=flushing alerts="[Prometheus OpsGenie HeartBeat[f4b6014][active]]"
level=debug ts=2020-07-15T11:11:41.576Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource1\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[e38ebb7][active]]"
level=debug ts=2020-07-15T11:11:41.773Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource2\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[7d13c4b][active]]"
level=debug ts=2020-07-15T11:11:41.773Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource3\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[d9869b8][active]]"
level=debug ts=2020-07-15T11:11:41.774Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource4\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[c170d2d][active]]"
level=debug ts=2020-07-15T11:11:41.774Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource5\"}" msg=flushing alerts="[Instance is Unreachable for TEST.*[72245f3][active]]"

# We receive the alert that inhibits the other 5 “Instance is Unreachable for TEST” 
## This is working in normal circumstances, we can see the alerts inhibited on the alertmanager panel.
level=debug ts=2020-07-15T11:11:42.274Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Scheduler for TEST.*[6f27987][active]"
level=debug ts=2020-07-15T11:11:42.274Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"blackhole\"}:{alertname=\"Scheduler for TEST.*\"}" msg=flushing alerts="[Scheduler for TEST.*[6f27987][active]]"
level=debug ts=2020-07-15T11:11:45.993Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:11:45 [DEBUG] memberlist: Stream connection from=10.171.9.4:46386\n"


# Alerts are sent to opsgenie without waiting the 5m group_wait
level=debug ts=2020-07-15T11:11:56.874Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource1\"}"
level=debug ts=2020-07-15T11:11:56.875Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource2\"}"
level=debug ts=2020-07-15T11:11:56.876Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource3\"}"
level=debug ts=2020-07-15T11:11:56.973Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource4\"}"
level=debug ts=2020-07-15T11:11:56.974Z caller=opsgenie.go:119 integration=opsgenie incident="{}:{alertid=\"UNREACHABLE\", alertname=\"Instance is Unreachable for TEST.*\", resourceid=\"Resource5\"}"
level=debug ts=2020-07-15T11:12:02.420Z caller=notify.go:685 component=dispatcher receiver=default-receiver integration=opsgenie[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:02.473Z caller=notify.go:685 component=dispatcher receiver=default-receiver integration=opsgenie[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:02.473Z caller=notify.go:685 component=dispatcher receiver=default-receiver integration=opsgenie[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:02.473Z caller=notify.go:685 component=dispatcher receiver=default-receiver integration=opsgenie[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:02.474Z caller=notify.go:685 component=dispatcher receiver=opsgenie integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:02.687Z caller=notify.go:685 component=dispatcher receiver=default-receiver integration=opsgenie[0] msg="Notify success" attempts=1
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[72245f3][active]"
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[e38ebb7][active]"
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[7d13c4b][active]"
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[d9869b8][active]"
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Instance is Unreachable for TEST.*[c170d2d][active]"
level=debug ts=2020-07-15T11:12:11.474Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Prometheus OpsGenie HeartBeat[f4b6014][active]"
level=debug ts=2020-07-15T11:12:12.453Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Scheduler for TEST.*[6f27987][active]"

I can confirm that inhibition was in fact working (i saw the inhibited alerts on the alertmanager web panel) but the 5 minute group_wait does not work (it seems that the alertmanager does not wait 5m to send alerts).

What i noticed this time is that the other alertmanager (alertmanager-1) doesn't seems to rejoin the cluster correctly, it shows this message over and over (alerts are sent to opsgenie at T11:11:56):

level=debug ts=2020-07-15T11:11:45.996Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:11:45 [DEBUG] memberlist: Initiating push/pull sync with: 01ED92PZXW18SXNAP176VTJ8D6 10.171.4.13:18976\n"
level=debug ts=2020-07-15T11:11:51.758Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:11:51 [DEBUG] memberlist: Failed to join 10.171.3.21: dial tcp 10.171.3.21:18976: connect: no route to host\n"
level=debug ts=2020-07-15T11:11:51.759Z caller=cluster.go:408 component=cluster msg=reconnect result=failure peer= addr=10.171.3.21:18976 err="1 error occurred:\n\t* Failed to join 10.171.3.21: dial tcp 10.171.3.21:18976: connect: no route to host\n\n"
level=debug ts=2020-07-15T11:11:51.762Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:11:51 [DEBUG] memberlist: Failed to join 10.171.4.12: dial tcp 10.171.4.12:18976: connect: no route to host\n"
level=debug ts=2020-07-15T11:11:51.762Z caller=cluster.go:408 component=cluster msg=reconnect result=failure peer=01ED92E5KEG7YQR76HAXNB7FX3 addr=10.171.4.12:18976 err="1 error occurred:\n\t* Failed to join 10.171.4.12: dial tcp 10.171.4.12:18976: connect: no route to host\n\n"
level=debug ts=2020-07-15T11:12:01.760Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:12:01 [DEBUG] memberlist: Failed to join 10.171.3.21: dial tcp 10.171.3.21:18976: connect: no route to host\n"
level=debug ts=2020-07-15T11:12:01.760Z caller=cluster.go:408 component=cluster msg=reconnect result=failure peer= addr=10.171.3.21:18976 err="1 error occurred:\n\t* Failed to join 10.171.3.21: dial tcp 10.171.3.21:18976: connect: no route to host\n\n"
level=debug ts=2020-07-15T11:12:01.763Z caller=cluster.go:306 component=cluster memberlist="2020/07/15 11:12:01 [DEBUG] memberlist: Failed to join 10.171.4.12: dial tcp 10.171.4.12:18976: connect: no route to host\n"
level=debug ts=2020-07-15T11:12:01.763Z caller=cluster.go:408 component=cluster msg=reconnect result=failure peer=01ED92E5KEG7YQR76HAXNB7FX3 addr=10.171.4.12:18976 err="1 error occurred:\n\t* Failed to join 10.171.4.12: dial tcp 10.171.4.12:18976: connect: no route to host\n\n"

# This error is repeated multiple times.

It seems that the alertmanager-1, the one i did not restart has trouble rejoining the cluster. I don't know if this is related but i wanted to let you know in case this helps.

Thanks a lot.

Regards.

@karolrec
Copy link

Is this issue forgotten? I've encountered it in 0.22.0. Same story - the order of alerts received from prometheus after restart decides whether alert is inhibited on time(before sending notification). It sounds logical and group_wait should technically prevent this kind of situation. But, as mentioned, group_wait is not respected :(

@gammore
Copy link
Author

gammore commented Jun 26, 2022

I've never solved it actually. The problem persisted and we changed the way we dismissed alerts.

I hope this can be resolved someday somehow.

@alxric
Copy link

alxric commented Jan 20, 2023

I made this pr to address the issue: #3167

We are currently using this and it has resolved the problem for us

@grobinson-grafana
Copy link
Contributor

What happens most probably is that Prometheus evaluates the inhibited alert (a) before the inhibiting alert (b) and the new Alertmanager receives (a) before (b). As I said earlier, alerts aren't replicated across Alertmanagers, only notification logs which aren't at play for inhibitions. Are alerts (a) and (b) declared in different rule groups?

I think @simonpasquier is 💯 .

When using Prometheus with Alertmanager, the inhibiting rule must be evaluated before the rule it inhibits. You should be able to achieve this in Prometheus adhering to the following rules:

  1. The inhibiting rule must be in the same group as the rule(s) it inhibits
  2. If an inhibiting rule needs to inhibit rules across multiple groups then the rule must be duplicated for each intended group. This is OK because Alertmanager deduplicates incoming alerts, even if the alerts come from different groups in Prometheus

This should mean that alerts from the inhibiting rule are sent to the Alertmanager either before or in the same request as alerts from any other rules it inhibits. There can be occasions where this doesn't happen, for example when the outbound queue of alerts waiting to be sent to Alertmanager is full and so the oldest alerts are dropped.

I know that inhibiting is actually related to alerts rather than rules, but I find it helps to think about them as rules.

Example 1

This example shows an inhibiting rule that needs to inhibit rules across a single group.

groups:
- name: Inhibit example 1
  rules:
  - alert: Inhibiting rule
    expr: 1
    for: 0s
    labels:
      inhibit: "true"
    annotations:
      summary: "This is an inhibiting rule"
  - alert: Inhibited rule
    expr: 1
    for: 5m
    labels:
      inhibited: "true"
    annotations:
      summary: "This is an inhibited rule"
inhibit_rules:
  - target_matchers:
      - inhibited="true"
    source_matchers:
      - inhibit="true"
  1. The alerts are received:
ts=2023-06-22T23:16:10.883Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
ts=2023-06-22T23:16:10.885Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]"
  1. Alertmanager crashes and restarts, but the inhibiting rule is still received first:
zsh: killed     ./alertmanager --config.file=config.yml --log.level=debug
ts=2023-06-22T23:17:54.788Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002235958s
ts=2023-06-22T23:18:40.878Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
ts=2023-06-22T23:18:40.878Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"Inhibiting rule\", inhibit=\"true\"}" msg=flushing alerts="[Inhibiting rule[93cf445][active]]"
ts=2023-06-22T23:18:40.879Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]"
ts=2023-06-22T23:18:40.879Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"Inhibited rule\", inhibited=\"true\"}" msg=flushing alerts="[Inhibited rule[f33ff29][active]]"

Example 2

This example shows an inhibiting rule that needs to inhibit rules across multiple groups. Here the inhibiting rule is duplicated for each intended group. This is OK because Alertmanager deduplicates incoming alerts, even if the alerts come from different groups in Prometheus.

groups:
- name: Inhibit example 1
  rules:
  - alert: Inhibiting rule
    expr: 1
    for: 0s
    labels:
      inhibit: "true"
    annotations:
      summary: "This is an inhibiting rule"
  - alert: Inhibited rule
    expr: 1
    for: 5m
    labels:
      inhibited: "true"
    annotations:
      summary: "This is an inhibited rule"
- name: Inhibit example 2
  rules:
  - alert: Inhibiting rule
    expr: 1
    for: 0s
    labels:
      inhibit: "true"
    annotations:
      summary: "This is an inhibiting rule. It is a duplicate of the original from Inhibit example 1"
  - alert: Inhibited rule 2
    expr: 1
    for: 5m
    labels:
      inhibited: "true"
    annotations:
      summary: "This is another inhibited rule"
  1. You'll see that Alertmanager receives the same inhibiting alert twice: once for each group. This is OK, Alertmanager will deduplicate them as their fingerprints are the same 93cf445:
ts=2023-06-22T23:20:20.763Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
ts=2023-06-22T23:20:25.896Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
  1. Alertmanager crashes and restarts, but again the inhibiting rules are still received first:
zsh: killed     ./alertmanager --config.file=config.yml --log.level=debug
ts=2023-06-22T23:28:00.469Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002461667s
ts=2023-06-22T23:28:05.744Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
ts=2023-06-22T23:28:05.744Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"Inhibiting rule\", inhibit=\"true\"}" msg=flushing alerts="[Inhibiting rule[93cf445][active]]"
ts=2023-06-22T23:28:05.744Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule 2[c807ba2][active]"
ts=2023-06-22T23:28:05.744Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}:{alertname=\"Inhibited rule 2\", inhibited=\"true\"}" msg=flushing alerts="[Inhibited rule 2[c807ba2][active]]"
ts=2023-06-22T23:28:05.885Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1
ts=2023-06-22T23:28:10.876Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]"
ts=2023-06-22T23:28:10.878Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants