Alertmanager not respecting repeat_interval #2934

shadow9911 · 2022-05-27T18:05:07Z

I'm running alertmanger 0.21. I have a repeat interval set for different routes and matches. Therepeat_intervalis set to 365d. However, when I get a few alerts in a FIRING state, at a completely random time, they start firing again. Seems like alertmanager is not taking into consideration my repeat_interval. The alertname is always the same. The FIRING time does not change, when there is a repeated send.

What did you expect to see?

Alerts not repeated before 365 days

What did you see instead? Under which circumstances?

Alerts are repeated at random times. Once per few days

Environment

production

Alertmanager version:

0.21.0

Prometheus version:

2.25.0

Alertmanager configuration file:

global:
    resolve_timeout: 5m

route:
  group_by: ['email_to', 'email_to2', 'email_to3', 'email_to4', 'volume', 'instance']
  receiver: email_router
  group_wait: 0s
  group_interval: 5m
  repeat_interval: 365d
  routes:

    - match_re:
        alertname: (^Server.*)$
      repeat_interval: 365d
      receiver: cloud_router
      continue: true
    - match_re:
        alertname: (^Server.*)$
      repeat_interval: 365d
      receiver: email_router
      continue: true

    - match_re:
        alertname: (^Custom.*)$
      repeat_interval: 365d
      receiver: custom_router
      continue: true


templates:
- '/etc/alertmanager/cloud.tpml'


receivers:
- name: cloud_router
  email_configs:
  - to: "{{ .GroupLabels.email_to }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true

- name: email_router
  email_configs:
  - to: "{{ .GroupLabels.email_to2 }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true


- name: custom_router
  email_configs:
  - to: "{{ .GroupLabels.email_to }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true

Logs:

May 27 14:25:44 hostname alertmanager[3950142]: level=debug ts=2022-05-27T14:25:44.084Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:(^Server.*)$)$\"}:{email_to2=\"

\\\"\\\"\", email_to=\"[email protected]\", instance=\"myserver.com:9182\"}" msg=flushing alerts="[Server DOWN[645d1c2][active]]"

May 27 14:25:44 hostname alertmanager[3950142]: level=error ts=2022-05-27T14:25:44.564Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email_router/email[0]: notify retry canceled due to unrecoverable error after 1 attempts: parse 'to' addresses: mail: missing '@' or angle-addr; email_router/email[1]: notify retry canceled due to unrecoverable error after 1 attempts: parse 'to' addresses: mail: no address"

May 27 14:25:44 hostname alertmanager[3950142]: level=debug ts=2022-05-27T14:25:44.698Z caller=notify.go:685 component=dispatcher receiver=cloud_router integration=email[0] msg="Notify success" attempts=1

The last line of the log should not occur. The server has been down for ~10 days.

The text was updated successfully, but these errors were encountered:

shadow9911 · 2022-05-28T08:45:44Z

Alright so I figured it out. For anyone struggling with a repeat_interval longer than a few days, this breaks due to --data.retention parameter of alertmanager. The default data.retention is 120h , or 5 days. In my case, alertmanager "forgets" the repeat_interval because an alert that was already FIRING gets fired again after the data.retention is executed. This is due to having data.retention poorly documented. I know that there is a warning, but this warning means nothing as to what can be expected, and it can be hardly noticable. I highly recommend to add info about data.retention when talking about repeat_interval in Prometheus/Alertmanager's documentation. It will save people alot of trouble.

Ref: #1806

shadow9911 closed this as completed May 28, 2022

soonping-amzn mentioned this issue Nov 26, 2022

Added note on data retention to documentation for repeat_interval #3147

Merged

s1sfa mentioned this issue Jul 25, 2023

Long Alert repeat interval(multiple weeks) realerts ever 5 days grafana/grafana#72308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager not respecting repeat_interval #2934

Alertmanager not respecting repeat_interval #2934

shadow9911 commented May 27, 2022

shadow9911 commented May 28, 2022

Alertmanager not respecting repeat_interval #2934

Alertmanager not respecting repeat_interval #2934

Comments

shadow9911 commented May 27, 2022

shadow9911 commented May 28, 2022