Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager not respecting repeat_interval #2934

Closed
shadow9911 opened this issue May 27, 2022 · 1 comment
Closed

Alertmanager not respecting repeat_interval #2934

shadow9911 opened this issue May 27, 2022 · 1 comment

Comments

@shadow9911
Copy link

I'm running alertmanger 0.21. I have a repeat interval set for different routes and matches. Therepeat_intervalis set to 365d. However, when I get a few alerts in a FIRING state, at a completely random time, they start firing again. Seems like alertmanager is not taking into consideration my repeat_interval. The alertname is always the same. The FIRING time does not change, when there is a repeated send.

What did you expect to see?

Alerts not repeated before 365 days

What did you see instead? Under which circumstances?

Alerts are repeated at random times. Once per few days

Environment

production

  • Alertmanager version:

0.21.0

  • Prometheus version:

2.25.0

  • Alertmanager configuration file:
global:
    resolve_timeout: 5m

route:
  group_by: ['email_to', 'email_to2', 'email_to3', 'email_to4', 'volume', 'instance']
  receiver: email_router
  group_wait: 0s
  group_interval: 5m
  repeat_interval: 365d
  routes:

    - match_re:
        alertname: (^Server.*)$
      repeat_interval: 365d
      receiver: cloud_router
      continue: true
    - match_re:
        alertname: (^Server.*)$
      repeat_interval: 365d
      receiver: email_router
      continue: true

    - match_re:
        alertname: (^Custom.*)$
      repeat_interval: 365d
      receiver: custom_router
      continue: true


templates:
- '/etc/alertmanager/cloud.tpml'


receivers:
- name: cloud_router
  email_configs:
  - to: "{{ .GroupLabels.email_to }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true

- name: email_router
  email_configs:
  - to: "{{ .GroupLabels.email_to2 }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true


- name: custom_router
  email_configs:
  - to: "{{ .GroupLabels.email_to }}"
    from: "[email protected]"
    headers:
      subject: "{{ .CommonAnnotations.summary }}"
    html: '{{ template "email.cloud.html" . }}'
    send_resolved: true

  • Logs:
May 27 14:25:44 hostname alertmanager[3950142]: level=debug ts=2022-05-27T14:25:44.084Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:(^Server.*)$)$\"}:{email_to2=\"

\\\"\\\"\", email_to=\"[email protected]\", instance=\"myserver.com:9182\"}" msg=flushing alerts="[Server DOWN[645d1c2][active]]"

May 27 14:25:44 hostname alertmanager[3950142]: level=error ts=2022-05-27T14:25:44.564Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email_router/email[0]: notify retry canceled due to unrecoverable error after 1 attempts: parse 'to' addresses: mail: missing '@' or angle-addr; email_router/email[1]: notify retry canceled due to unrecoverable error after 1 attempts: parse 'to' addresses: mail: no address"

May 27 14:25:44 hostname alertmanager[3950142]: level=debug ts=2022-05-27T14:25:44.698Z caller=notify.go:685 component=dispatcher receiver=cloud_router integration=email[0] msg="Notify success" attempts=1

The last line of the log should not occur. The server has been down for ~10 days.

@shadow9911
Copy link
Author

Alright so I figured it out. For anyone struggling with a repeat_interval longer than a few days, this breaks due to --data.retention parameter of alertmanager. The default data.retention is 120h , or 5 days. In my case, alertmanager "forgets" the repeat_interval because an alert that was already FIRING gets fired again after the data.retention is executed. This is due to having data.retention poorly documented. I know that there is a warning, but this warning means nothing as to what can be expected, and it can be hardly noticable. I highly recommend to add info about data.retention when talking about repeat_interval in Prometheus/Alertmanager's documentation. It will save people alot of trouble.

Ref: #1806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant