Skip to content

Commit

Permalink
Add runbooks for API response times
Browse files Browse the repository at this point in the history
  • Loading branch information
krysal committed Sep 8, 2023
1 parent db9a1b0 commit 327e443
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Run Book: API Production Average Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Average+Response+Time+under+threshold?>
```

## Severity Guide

To identify the source of the slowdown first check if there was a recent
deployment that may have introduced the problem, in that case rollback to the
previous version. Otherwise, check the following, in order:

1. Request count and general network activity. If abnormally high, refer to the
[traffic analysis run book](traffic_runbook) to identify whether there is
malicious traffic. If not, move on.
2. Check if dependencies like Elasticsearch or the database are constrained. If
stable, move on.
3. Parse query parameters from Nginx logs and check pagination and parameter
count activity for abnormal or unexpected behaviour. If any exist, decide
whether it is malicious or expected.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-01 at 18:10 UTC: Increased API response times over filtered image
index recreation
- 2023-08-20 at 20:00 UTC: Increased API response times (reason unknown)
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Run Book: API Production Average Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @krysaldb
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Production+Average+Response+Time+under+threshold?>
```

## Severity Guide

To identify the source of the slowdown first check if there was a recent
deployment that may have introduced the problem, in that case rollback to the
previous version. Otherwise, check the following, in order:

1. Request count and general network activity. If abnormally high, refer to the
[traffic analysis run book](traffic_runbook) to identify whether there is
malicious traffic. If not, move on.
2. Check if dependencies like Elasticsearch or the database are constrained. If
stable, move on.
3. Parse query parameters from Nginx logs and check pagination and parameter
count activity for abnormal or unexpected behaviour. If any exist, decide
whether it is malicious or expected.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-01 at 18:10 UTC: Increased API response times over filtered image
index recreation
- 2023-08-20 at 20:00 UTC: Increased API response times (reason unknown)
2 changes: 2 additions & 0 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,6 @@ that can be a good resource when writing a new one.
:titlesonly:
unhealthy_ecs_hosts
api_avg_response_time_above_threshold
api_p99_response_time_above_threshold
```

0 comments on commit 327e443

Please sign in to comment.