Skip to content

Commit

Permalink
ROX-17083: Make PagerDuty the default receiver (#1039)
Browse files Browse the repository at this point in the history
* ROX-17083: Make PagerDuty the default receiver

This is part of a larger effort to stop treating stage alerts as
critical alerts in PagerDuty.  The specific changes being made in this
PR:

* Pass the `severity` parameter in the PagerDuty receiver configuration.
  This will be used in production to capture non-critical production
  events in PagerDuty without triggering a critical notification to
  on-call engineers.  (These events are currently dropped in
  Alertmanager)
* Switch to the PagerDuty v2 Events API.  This is required for PD to
  recognize the `severity` parameter.  This is done by switching from
  `service_key` to `routing_key`.
* Updated the key in `terraform_cluster.sh` (and thus AWS Secrets
  Manager) to reflect the routing key change.
* Made PagerDuty the default receiver in the Alertmanager config.  This
  will send non-critical alerts to PagerDuty, treating it as a hub for
  all events across all data plane clusters.

It's implied that the PagerDuty routing key for the stage environment
will be different from the one in production.  This is so the stage
service can be configured to force all incidents to be considered low
priority to avoid paging on-call engineers.

* Add comment to explain confusing yaml
  • Loading branch information
kylape authored May 19, 2023
1 parent c3b8f5c commit 20d5a19
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,34 @@ stringData:
global:
resolve_timeout: 5m
route:
receiver: default-receiver
receiver: managed-rhacs-pagerduty
repeat_interval: 12h
routes:
- receiver: managed-rhacs-pagerduty
match:
observability: managed-rhacs
severity: critical
- receiver: managed-rhacs-deadmanssnitch
repeat_interval: 5m
match:
alertname: DeadMansSwitch
observability: managed-rhacs
receivers:
- name: default-receiver
- name: managed-rhacs-pagerduty
pagerduty_configs:
- service_key: {{ .Values.pagerduty.key | quote }}
- routing_key: {{ .Values.pagerduty.key | quote }}
{{- /*
We want the severity to be based on the severity label coming from the
alert itself. If there is no severity label common to the group of
alerts, then default to info. That looks like:
`or .GroupLabels.severity "info"`
in Go templating. To properly escape for Helm templating, the Helm
templating engine needs to output the literal string `{{`, since
Alertmanager templating syntax is the same as Helm. To do that,
the expression `{{` is used inside the double bracket syntax for
evaluating Go template expressions. Thus: `{{ "{{" }}`.
The inner double quotes work because Helm evaluates the expression
that includes the inner double quotes before the document is parsed
as yaml.
*/}}
severity: "{{ "{{" }} or .GroupLabels.severity \"info\" }}"
- name: managed-rhacs-deadmanssnitch
webhook_configs:
- url: {{ .Values.deadMansSwitch.url | quote }}
Expand Down
2 changes: 1 addition & 1 deletion dp-terraform/helm/rhacs-terraform/terraform_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ invoke_helm "${SCRIPT_DIR}" rhacs-terraform \
--set observability.observatorium.gateway="${OBSERVABILITY_OBSERVATORIUM_GATEWAY}" \
--set observability.observatorium.metricsClientId="${OBSERVABILITY_OBSERVATORIUM_METRICS_CLIENT_ID}" \
--set observability.observatorium.metricsSecret="${OBSERVABILITY_OBSERVATORIUM_METRICS_SECRET}" \
--set observability.pagerduty.key="${OBSERVABILITY_PAGERDUTY_SERVICE_KEY}" \
--set observability.pagerduty.key="${OBSERVABILITY_PAGERDUTY_ROUTING_KEY}" \
--set observability.deadMansSwitch.url="${OBSERVABILITY_DEAD_MANS_SWITCH_URL}"

# To uninstall an existing release:
Expand Down

0 comments on commit 20d5a19

Please sign in to comment.