Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-17083: Make PagerDuty the default receiver #1039

Merged
merged 2 commits into from
May 19, 2023
Merged

ROX-17083: Make PagerDuty the default receiver #1039

merged 2 commits into from
May 19, 2023

Conversation

kylape
Copy link
Contributor

@kylape kylape commented May 17, 2023

Description

This is part of a larger effort to stop treating stage alerts as critical alerts in PagerDuty. The specific changes being made in this PR:

  • Pass the severity parameter in the PagerDuty receiver configuration. This will be used in production to capture non-critical production events in PagerDuty without triggering a critical notification to on-call engineers. (These events are currently dropped in Alertmanager)
  • Switch to the PagerDuty v2 Events API. This is required for PD to recognize the severity parameter. This is done by switching from service_key to routing_key.
  • Updated the key in terraform_cluster.sh (and thus AWS Secrets Manager) to reflect the routing key change.
  • Made PagerDuty the default receiver in the Alertmanager config. This will send non-critical alerts to PagerDuty, treating it as a hub for all events across all data plane clusters.

It's implied that the PagerDuty routing key for the stage environment will be different from the one in production. This is so the stage service can be configured to force all incidents to be considered low priority to avoid paging on-call engineers.

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
  • Add secret to app-interface Vault or Secrets Manager if necessary
  • RDS changes were e2e tested manually
  • Check AWS limits are reasonable for changes provisioning new resources

Test manual

  • get the routing_key
  • set the template string in the PD severity {{ or .GroupLabels.severity 'info' }}
  • render helm chart and get alertmanager.yaml
  • fire up test flask app (prometheus exporter)
  • configure alert to be a non-critical severity
  • fire up alertmanager and prometheus locally
  • trigger alert with curl command
  • this should create an incident in PD and send to slack, but NOT notify me personally
  • let it resolve
  • configure the alert to be a critical severity
  • trigger alert with curl command
  • this should page me as well as create a slack message (won't normally happen
    after updating notifcation config in stage PD service)
  • let it resolve

This is part of a larger effort to stop treating stage alerts as
critical alerts in PagerDuty.  The specific changes being made in this
PR:

* Pass the `severity` parameter in the PagerDuty receiver configuration.
  This will be used in production to capture non-critical production
  events in PagerDuty without triggering a critical notification to
  on-call engineers.  (These events are currently dropped in
  Alertmanager)
* Switch to the PagerDuty v2 Events API.  This is required for PD to
  recognize the `severity` parameter.  This is done by switching from
  `service_key` to `routing_key`.
* Updated the key in `terraform_cluster.sh` (and thus AWS Secrets
  Manager) to reflect the routing key change.
* Made PagerDuty the default receiver in the Alertmanager config.  This
  will send non-critical alerts to PagerDuty, treating it as a hub for
  all events across all data plane clusters.

It's implied that the PagerDuty routing key for the stage environment
will be different from the one in production.  This is so the stage
service can be configured to force all incidents to be considered low
priority to avoid paging on-call engineers.
@kylape kylape temporarily deployed to development May 17, 2023 18:33 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 17, 2023 18:33 — with GitHub Actions Inactive
@kylape
Copy link
Contributor Author

kylape commented May 17, 2023

/retest

@kylape
Copy link
Contributor Author

kylape commented May 17, 2023

I went ahead and added the routing key to the Secret Manager for stage (but not for prod).

- name: managed-rhacs-pagerduty
pagerduty_configs:
- service_key: {{ .Values.pagerduty.key | quote }}
- routing_key: {{ .Values.pagerduty.key | quote }}
severity: "{{ "{{" }} or .GroupLabels.severity \"info\" }}"
Copy link
Contributor

@stehessel stehessel May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even correct yaml? Looks like some " are not correctly escaped here? Can you maybe explain what this is supposed to do? How are info alerts handled? I guess this is what you get when you have merge yaml, Helm and PagerDuty templating... 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is what you get when you have merge yaml, Helm and PagerDuty templating
Exactly :)

Honestly I thought the quotes inside the quotes wouldn't work, but it worked when I tested it. Let me try to explain this line in plain english.

I want the severity to be based on the severity label coming from the alert itself. If there is no severity label common to the group of alerts, then default to info. That looks like or .GroupLabels.severity "info" in Go templating. To properly escape for Helm templating, I needed to have the Helm templating engine output the literal string {{, since Alertmanager templating syntax is the same as Helm. To do that, I use the expression "{{" inside the double bracket syntax for evaluating Go template expressions. Thus {{ "{{" }}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright thanks for the explanation. Can you please put this explanation as a comment above this line?

@stehessel
Copy link
Contributor

I think we have to notify on-call engineers + SRE to set up their high/low urgency notification settings correctly. Otherwise it might have the opposite effect of more pages over the weekend, when someone has low urgency linked to their phone and we now sent alerts for warning/info.

@kylape
Copy link
Contributor Author

kylape commented May 19, 2023

That's true for the production environment. For stage, the schedule is just me, so no one will get paged for those. For this weekend, I can contact the on-call engineer to have them set up the notifications correctly before the weekend starts.

@stehessel
Copy link
Contributor

I guess it'll only be rolled out for stage anyway, as we would need a new release to rollout the prod changes.

Copy link
Contributor

@stehessel stehessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo putting the severity template explanation as a comment.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kylape, stehessel
Once this PR has been reviewed and has the lgtm label, please assign kurlov for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm label May 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2023

New changes are detected. LGTM label has been removed.

@kylape kylape temporarily deployed to development May 19, 2023 16:00 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:04 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:04 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:27 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:27 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:59 — with GitHub Actions Inactive
@kylape kylape temporarily deployed to development May 19, 2023 17:59 — with GitHub Actions Inactive
@kylape kylape merged commit 20d5a19 into main May 19, 2023
@kylape kylape deleted the rox-17083-pd branch May 19, 2023 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants