From 327e443c7a754c647359f81bfde130040888560f Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Fri, 8 Sep 2023 16:04:03 -0400 Subject: [PATCH] Add runbooks for API response times --- .../api_avg_response_time_above_threshold.md | 38 +++++++++++++++++++ .../api_p99_response_time_above_threshold.md | 38 +++++++++++++++++++ .../meta/monitoring/runbooks/index.md | 2 + 3 files changed, 78 insertions(+) create mode 100644 documentation/meta/monitoring/runbooks/api_avg_response_time_above_threshold.md create mode 100644 documentation/meta/monitoring/runbooks/api_p99_response_time_above_threshold.md diff --git a/documentation/meta/monitoring/runbooks/api_avg_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_avg_response_time_above_threshold.md new file mode 100644 index 00000000000..0dc34e8e2bf --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_avg_response_time_above_threshold.md @@ -0,0 +1,38 @@ +# Run Book: API Production Average Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** + +Maintainer: @krysaldb + +Alarm link: +- +``` + +## Severity Guide + +To identify the source of the slowdown first check if there was a recent +deployment that may have introduced the problem, in that case rollback to the +previous version. Otherwise, check the following, in order: + +1. Request count and general network activity. If abnormally high, refer to the + [traffic analysis run book](traffic_runbook) to identify whether there is + malicious traffic. If not, move on. +2. Check if dependencies like Elasticsearch or the database are constrained. If + stable, move on. +3. Parse query parameters from Nginx logs and check pagination and parameter + count activity for abnormal or unexpected behaviour. If any exist, decide + whether it is malicious or expected. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-01 at 18:10 UTC: Increased API response times over filtered image + index recreation +- 2023-08-20 at 20:00 UTC: Increased API response times (reason unknown) diff --git a/documentation/meta/monitoring/runbooks/api_p99_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_p99_response_time_above_threshold.md new file mode 100644 index 00000000000..0dc34e8e2bf --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_p99_response_time_above_threshold.md @@ -0,0 +1,38 @@ +# Run Book: API Production Average Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** + +Maintainer: @krysaldb + +Alarm link: +- +``` + +## Severity Guide + +To identify the source of the slowdown first check if there was a recent +deployment that may have introduced the problem, in that case rollback to the +previous version. Otherwise, check the following, in order: + +1. Request count and general network activity. If abnormally high, refer to the + [traffic analysis run book](traffic_runbook) to identify whether there is + malicious traffic. If not, move on. +2. Check if dependencies like Elasticsearch or the database are constrained. If + stable, move on. +3. Parse query parameters from Nginx logs and check pagination and parameter + count activity for abnormal or unexpected behaviour. If any exist, decide + whether it is malicious or expected. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-01 at 18:10 UTC: Increased API response times over filtered image + index recreation +- 2023-08-20 at 20:00 UTC: Increased API response times (reason unknown) diff --git a/documentation/meta/monitoring/runbooks/index.md b/documentation/meta/monitoring/runbooks/index.md index 0fa0d689139..82a612b9568 100644 --- a/documentation/meta/monitoring/runbooks/index.md +++ b/documentation/meta/monitoring/runbooks/index.md @@ -13,4 +13,6 @@ that can be a good resource when writing a new one. :titlesonly: unhealthy_ecs_hosts +api_avg_response_time_above_threshold +api_p99_response_time_above_threshold ```