ROX-17083: Make PagerDuty the default receiver #1039

kylape · 2023-05-17T18:32:57Z

Description

This is part of a larger effort to stop treating stage alerts as critical alerts in PagerDuty. The specific changes being made in this PR:

Pass the severity parameter in the PagerDuty receiver configuration. This will be used in production to capture non-critical production events in PagerDuty without triggering a critical notification to on-call engineers. (These events are currently dropped in Alertmanager)
Switch to the PagerDuty v2 Events API. This is required for PD to recognize the severity parameter. This is done by switching from service_key to routing_key.
Updated the key in terraform_cluster.sh (and thus AWS Secrets Manager) to reflect the routing key change.
Made PagerDuty the default receiver in the Alertmanager config. This will send non-critical alerts to PagerDuty, treating it as a hub for all events across all data plane clusters.

It's implied that the PagerDuty routing key for the stage environment will be different from the one in production. This is so the stage service can be configured to force all incidents to be considered low priority to avoid paging on-call engineers.

Checklist (Definition of Done)

Unit and integration tests added
Added test description under Test manual
Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
CI and all relevant tests are passing
Add the ticket number to the PR title if available, i.e. ROX-12345: ...
Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
Add secret to app-interface Vault or Secrets Manager if necessary
RDS changes were e2e tested manually
Check AWS limits are reasonable for changes provisioning new resources

Test manual

get the routing_key
set the template string in the PD severity {{ or .GroupLabels.severity 'info' }}
render helm chart and get alertmanager.yaml
fire up test flask app (prometheus exporter)
configure alert to be a non-critical severity
fire up alertmanager and prometheus locally
trigger alert with curl command
this should create an incident in PD and send to slack, but NOT notify me personally
let it resolve
configure the alert to be a critical severity
trigger alert with curl command
this should page me as well as create a slack message (won't normally happen
after updating notifcation config in stage PD service)
let it resolve

This is part of a larger effort to stop treating stage alerts as critical alerts in PagerDuty. The specific changes being made in this PR: * Pass the `severity` parameter in the PagerDuty receiver configuration. This will be used in production to capture non-critical production events in PagerDuty without triggering a critical notification to on-call engineers. (These events are currently dropped in Alertmanager) * Switch to the PagerDuty v2 Events API. This is required for PD to recognize the `severity` parameter. This is done by switching from `service_key` to `routing_key`. * Updated the key in `terraform_cluster.sh` (and thus AWS Secrets Manager) to reflect the routing key change. * Made PagerDuty the default receiver in the Alertmanager config. This will send non-critical alerts to PagerDuty, treating it as a hub for all events across all data plane clusters. It's implied that the PagerDuty routing key for the stage environment will be different from the one in production. This is so the stage service can be configured to force all incidents to be considered low priority to avoid paging on-call engineers.

kylape · 2023-05-17T19:41:39Z

/retest

kylape · 2023-05-17T21:15:02Z

I went ahead and added the routing key to the Secret Manager for stage (but not for prod).

stehessel · 2023-05-19T14:07:00Z

.../helm/rhacs-terraform/charts/observability/templates/01-operator-03-secret-alertmanager.yaml

    - name: managed-rhacs-pagerduty
      pagerduty_configs:
-      - service_key: {{ .Values.pagerduty.key | quote }}
+      - routing_key: {{ .Values.pagerduty.key | quote }}
+        severity: "{{ "{{" }} or .GroupLabels.severity \"info\" }}"


Is this even correct yaml? Looks like some " are not correctly escaped here? Can you maybe explain what this is supposed to do? How are info alerts handled? I guess this is what you get when you have merge yaml, Helm and PagerDuty templating... 😅

I guess this is what you get when you have merge yaml, Helm and PagerDuty templating
Exactly :)

Honestly I thought the quotes inside the quotes wouldn't work, but it worked when I tested it. Let me try to explain this line in plain english.

I want the severity to be based on the severity label coming from the alert itself. If there is no severity label common to the group of alerts, then default to info. That looks like or .GroupLabels.severity "info" in Go templating. To properly escape for Helm templating, I needed to have the Helm templating engine output the literal string {{, since Alertmanager templating syntax is the same as Helm. To do that, I use the expression "{{" inside the double bracket syntax for evaluating Go template expressions. Thus {{ "{{" }}.

Alright thanks for the explanation. Can you please put this explanation as a comment above this line?

stehessel · 2023-05-19T14:08:47Z

I think we have to notify on-call engineers + SRE to set up their high/low urgency notification settings correctly. Otherwise it might have the opposite effect of more pages over the weekend, when someone has low urgency linked to their phone and we now sent alerts for warning/info.

kylape · 2023-05-19T14:25:09Z

That's true for the production environment. For stage, the schedule is just me, so no one will get paged for those. For this weekend, I can contact the on-call engineer to have them set up the notifications correctly before the weekend starts.

stehessel · 2023-05-19T15:06:30Z

I guess it'll only be rolled out for stage anyway, as we would need a new release to rollout the prod changes.

stehessel

LGTM modulo putting the severity template explanation as a comment.

openshift-ci · 2023-05-19T15:15:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kylape, stehessel
Once this PR has been reviewed and has the lgtm label, please assign kurlov for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-05-19T16:00:11Z

New changes are detected. LGTM label has been removed.

kylape temporarily deployed to development May 17, 2023 18:33 — with GitHub Actions Inactive

kylape mentioned this pull request May 17, 2023

ROX-17083: Send stage alerts directly to Slack #1034

Closed

5 tasks

kylape requested review from porridge and stehessel May 19, 2023 13:02

stehessel reviewed May 19, 2023

View reviewed changes

stehessel approved these changes May 19, 2023

View reviewed changes

openshift-ci bot assigned stehessel May 19, 2023

openshift-ci bot added the lgtm label May 19, 2023

openshift-ci bot removed the lgtm label May 19, 2023

kylape had a problem deploying to development May 19, 2023 16:00 — with GitHub Actions Failure

kylape temporarily deployed to development May 19, 2023 16:00 — with GitHub Actions Inactive

kylape force-pushed the rox-17083-pd branch from 4b7c001 to 0d7b6cf Compare May 19, 2023 17:04

kylape temporarily deployed to development May 19, 2023 17:04 — with GitHub Actions Inactive

kylape force-pushed the rox-17083-pd branch from 0d7b6cf to d957573 Compare May 19, 2023 17:27

kylape temporarily deployed to development May 19, 2023 17:27 — with GitHub Actions Inactive

Add comment to explain confusing yaml

c7bfc60

kylape force-pushed the rox-17083-pd branch from d957573 to c7bfc60 Compare May 19, 2023 17:59

kylape temporarily deployed to development May 19, 2023 17:59 — with GitHub Actions Inactive

kylape merged commit 20d5a19 into main May 19, 2023

kylape deleted the rox-17083-pd branch May 19, 2023 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-17083: Make PagerDuty the default receiver #1039

ROX-17083: Make PagerDuty the default receiver #1039

kylape commented May 17, 2023 •

edited

Loading

kylape commented May 17, 2023

kylape commented May 17, 2023

stehessel May 19, 2023 •

edited

Loading

kylape May 19, 2023

stehessel May 19, 2023

stehessel commented May 19, 2023

kylape commented May 19, 2023

stehessel commented May 19, 2023

stehessel left a comment

openshift-ci bot commented May 19, 2023

openshift-ci bot commented May 19, 2023

ROX-17083: Make PagerDuty the default receiver #1039

ROX-17083: Make PagerDuty the default receiver #1039

Conversation

kylape commented May 17, 2023 • edited Loading

Description

Checklist (Definition of Done)

Test manual

kylape commented May 17, 2023

kylape commented May 17, 2023

stehessel May 19, 2023 • edited Loading

Choose a reason for hiding this comment

kylape May 19, 2023

Choose a reason for hiding this comment

stehessel May 19, 2023

Choose a reason for hiding this comment

stehessel commented May 19, 2023

kylape commented May 19, 2023

stehessel commented May 19, 2023

stehessel left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 19, 2023

openshift-ci bot commented May 19, 2023

kylape commented May 17, 2023 •

edited

Loading

stehessel May 19, 2023 •

edited

Loading