Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialise runbooks for Nuxt 2XX/5XX alarms #2974

Merged
merged 7 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ that can be a good resource when writing a new one.
:titlesonly:

unhealthy_ecs_hosts
nuxt_2xx_under_threshold
nuxt_5xx_above_threshold
nuxt_request_count
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Run Book: Nuxt 2XX request count under threshold

```{admonition} Metadata
Status: **Unstable**

Maintainer: @dhruvkb

Alarm link:
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+HTTP+2XX+responses+count+under+threshold)
```

## Severity guide

Confirm there is not an outage.

Check if the overall request count has decreased as well (this can be confirmed
via the
[CloudWatch dashboard](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards/dashboard/ECS-Production-Dashboard)
or in Cloudflare).

- If the overall requests have decreased, the severity is low. But you should
continue to investigate why the usage has decreased below the usual amount.
- If the overall requests have not decreased, a large number of those requests
must be returning non-2XX responses, which is high severity. Further
investigation is warranted to determine the cause for the non-2XX responses.

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Run Book: Nuxt 5XX request count above threshold

```{admonition} Metadata
Status: **Unstable**

Maintainer: @dhruvkb

Alarm link:
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+HTTP+5XX+responses+count+over+threshold)
```

## Severity guide

Confirm there is not an outage.

Check if the connection to the API from Nuxt has been broken, which can result
in Nuxt returning 5XX errors.

If the connection is present and working, try to determine the source of the 5XX
errors (this can be checked by observing paths in the Cloudflare logs).

- If the API requests are returning 2XX responses, the severity is low. But you
should continue to investigate the source of 5XX errors, which could be an
external service like Plausible.
dhruvkb marked this conversation as resolved.
Show resolved Hide resolved
- If the API requests are returning 5XX responses, the severity is high. Further
investigation into the API side is warranted to determine the cause for the
5XX responses. Also refer to the
[API 5XX runbook](/meta/monitoring/runbooks/index.md).

<!-- TODO: Update link to /meta/monitoring/runbooks/api_5xx_above_threshold.md -->

## Historical false positives

Nothing registered to date.

## Related incident reports

- _2023-08-28, 12:06 to 12:24 UTC_:

5XX responses spiked to ~591 due to Plausible degradation. This was not
detrimental to UX.
38 changes: 38 additions & 0 deletions documentation/meta/monitoring/runbooks/nuxt_request_count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Run Book: Nuxt request count above threshold

```{admonition} Metadata
Status: **Unstable**

Maintainer: @dhruvkb

Alarm link:
- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+request+count+above+threshold)
```

## Severity guide

[Identify traffic anomalies](/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md)
in Cloudflare to determine if the increase is organic or due to a botnet.

- If the increase is organic, we must update our baseline expectation of our
services' usages. The alarm thresholds should be updated if our services see
higher usage frequently and consistently.
- If the increase is a botnet attack, we need to block these agents to restore
usage to the usual level.

We also need to verify that the requests are being handled properly and that our
services are capable of meeting this demand (this can be observed from the CPU
and memory metrics in the ECS dashboards in CloudWatch).

- If our infra can handle the load, there is not much to do except continue to
monitor that the resources stay within reasonable limits.
dhruvkb marked this conversation as resolved.
Show resolved Hide resolved
- If our infra cannot handle the load, we must scale our services by increasing
capacity or adding more instances.

## Historical false positives

Nothing registered to date.

## Related incident reports

Nothing registered to date.