WordPress · dhruvkb · Sep 14, 2023 · Sep 4, 2023 · Sep 4, 2023 · Sep 7, 2023
@@ -13,4 +13,7 @@ that can be a good resource when writing a new one.
 :titlesonly:
 
 unhealthy_ecs_hosts
+nuxt_2xx_under_threshold
+nuxt_5xx_above_threshold
+nuxt_request_count
 ```
@@ -0,0 +1,33 @@
+# Run Book: Nuxt 2XX request count under threshold
+
+```{admonition} Metadata
+Status: **Unstable**
+
+Maintainer: @dhruvkb
+
+Alarm link:
+- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+HTTP+2XX+responses+count+under+threshold)
+```
+
+## Severity guide
+
+Confirm there is not an outage.
+
+Check if the overall request count has decreased as well (this can be confirmed
+via the
+[CloudWatch dashboard](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards/dashboard/ECS-Production-Dashboard)
+or in Cloudflare).
+
+- If the overall requests have decreased, the severity is low. But you should
+  continue to investigate why the usage has decreased below the usual amount.
+- If the overall requests have not decreased, a large number of those requests
+  must be returning non-2XX responses, which is high severity. Further
+  investigation is warranted to determine the cause for the non-2XX responses.
+
+## Historical false positives
+
+Nothing registered to date.
+
+## Related incident reports
+
+Nothing registered to date.
@@ -0,0 +1,41 @@
+# Run Book: Nuxt 5XX request count above threshold
+
+```{admonition} Metadata
+Status: **Unstable**
+
+Maintainer: @dhruvkb
+
+Alarm link:
+- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+HTTP+5XX+responses+count+over+threshold)
+```
+
+## Severity guide
+
+Confirm there is not an outage.
+
+Check if the connection to the API from Nuxt has been broken, which can result
+in Nuxt returning 5XX errors.
+
+If the connection is present and working, try to determine the source of the 5XX
+errors (this can be checked by observing paths in the Cloudflare logs).
+
+- If the API requests are returning 2XX responses, the severity is low. But you
+  should continue to investigate the source of 5XX errors, which could be an
+  external service like Plausible.
+- If the API requests are returning 5XX responses, the severity is high. Further
+  investigation into the API side is warranted to determine the cause for the
+  5XX responses. Also refer to the
+  [API 5XX runbook](/meta/monitoring/runbooks/index.md).
+
+<!-- TODO: Update link to /meta/monitoring/runbooks/api_5xx_above_threshold.md -->
+
+## Historical false positives
+
+Nothing registered to date.
+
+## Related incident reports
+
+- _2023-08-28, 12:06 to 12:24 UTC_:
+
+  5XX responses spiked to ~591 due to Plausible degradation. This was not
+  detrimental to UX.
@@ -0,0 +1,38 @@
+# Run Book: Nuxt request count above threshold
+
+```{admonition} Metadata
+Status: **Unstable**
+
+Maintainer: @dhruvkb
+
+Alarm link:
+- [production-nuxt](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/Nuxt+Production+request+count+above+threshold)
+```
+
+## Severity guide
+
+[Identify traffic anomalies](/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md)
+in Cloudflare to determine if the increase is organic or due to a botnet.
+
+- If the increase is organic, we must update our baseline expectation of our
+  services' usages. The alarm thresholds should be updated if our services see
+  higher usage frequently and consistently.
+- If the increase is a botnet attack, we need to block these agents to restore
+  usage to the usual level.
+
+We also need to verify that the requests are being handled properly and that our
+services are capable of meeting this demand (this can be observed from the CPU
+and memory metrics in the ECS dashboards in CloudWatch).
+
+- If our infra can handle the load, there is not much to do except continue to
+  monitor that the resources stay within reasonable limits.
+- If our infra cannot handle the load, we must scale our services by increasing
+  capacity or adding more instances.
+
+## Historical false positives
+
+Nothing registered to date.
+
+## Related incident reports
+
+Nothing registered to date.