-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-17083: Make PagerDuty the default receiver #1039
Conversation
This is part of a larger effort to stop treating stage alerts as critical alerts in PagerDuty. The specific changes being made in this PR: * Pass the `severity` parameter in the PagerDuty receiver configuration. This will be used in production to capture non-critical production events in PagerDuty without triggering a critical notification to on-call engineers. (These events are currently dropped in Alertmanager) * Switch to the PagerDuty v2 Events API. This is required for PD to recognize the `severity` parameter. This is done by switching from `service_key` to `routing_key`. * Updated the key in `terraform_cluster.sh` (and thus AWS Secrets Manager) to reflect the routing key change. * Made PagerDuty the default receiver in the Alertmanager config. This will send non-critical alerts to PagerDuty, treating it as a hub for all events across all data plane clusters. It's implied that the PagerDuty routing key for the stage environment will be different from the one in production. This is so the stage service can be configured to force all incidents to be considered low priority to avoid paging on-call engineers.
/retest |
I went ahead and added the routing key to the Secret Manager for stage (but not for prod). |
- name: managed-rhacs-pagerduty | ||
pagerduty_configs: | ||
- service_key: {{ .Values.pagerduty.key | quote }} | ||
- routing_key: {{ .Values.pagerduty.key | quote }} | ||
severity: "{{ "{{" }} or .GroupLabels.severity \"info\" }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this even correct yaml? Looks like some "
are not correctly escaped here? Can you maybe explain what this is supposed to do? How are info
alerts handled? I guess this is what you get when you have merge yaml, Helm and PagerDuty templating... 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is what you get when you have merge yaml, Helm and PagerDuty templating
Exactly :)
Honestly I thought the quotes inside the quotes wouldn't work, but it worked when I tested it. Let me try to explain this line in plain english.
I want the severity to be based on the severity label coming from the alert itself. If there is no severity label common to the group of alerts, then default to info
. That looks like or .GroupLabels.severity "info"
in Go templating. To properly escape for Helm templating, I needed to have the Helm templating engine output the literal string {{
, since Alertmanager templating syntax is the same as Helm. To do that, I use the expression "{{"
inside the double bracket syntax for evaluating Go template expressions. Thus {{ "{{" }}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright thanks for the explanation. Can you please put this explanation as a comment above this line?
I think we have to notify on-call engineers + SRE to set up their high/low urgency notification settings correctly. Otherwise it might have the opposite effect of more pages over the weekend, when someone has low urgency linked to their phone and we now sent alerts for |
That's true for the production environment. For stage, the schedule is just me, so no one will get paged for those. For this weekend, I can contact the on-call engineer to have them set up the notifications correctly before the weekend starts. |
I guess it'll only be rolled out for stage anyway, as we would need a new release to rollout the prod changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo putting the severity template explanation as a comment.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kylape, stehessel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
New changes are detected. LGTM label has been removed. |
Description
This is part of a larger effort to stop treating stage alerts as critical alerts in PagerDuty. The specific changes being made in this PR:
severity
parameter in the PagerDuty receiver configuration. This will be used in production to capture non-critical production events in PagerDuty without triggering a critical notification to on-call engineers. (These events are currently dropped in Alertmanager)severity
parameter. This is done by switching fromservice_key
torouting_key
.terraform_cluster.sh
(and thus AWS Secrets Manager) to reflect the routing key change.It's implied that the PagerDuty routing key for the stage environment will be different from the one in production. This is so the stage service can be configured to force all incidents to be considered low priority to avoid paging on-call engineers.
Checklist (Definition of Done)
Test manual
ROX-12345: ...
Test manual
routing_key
{{ or .GroupLabels.severity 'info' }}
alertmanager.yaml
after updating notifcation config in stage PD service)