From 2966a2636aba8982fa6b3a2c3f351638e84721ff Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 21 Sep 2023 15:03:59 -0700 Subject: [PATCH 1/3] Add thumb repsonse time runbooks --- ...nails_avg_response_time_above_threshold.md | 28 +++++++++++++++++++ ...pi_thumbnails_avg_response_time_anomaly.md | 28 +++++++++++++++++++ ...nails_p99_response_time_above_threshold.md | 28 +++++++++++++++++++ ...pi_thumbnails_p99_response_time_anomaly.md | 28 +++++++++++++++++++ 4 files changed, 112 insertions(+) create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md new file mode 100644 index 00000000000..825cef056aa --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production Average Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md new file mode 100644 index 00000000000..d556118fb5a --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production Average Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md new file mode 100644 index 00000000000..870f9598285 --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production P99 Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md new file mode 100644 index 00000000000..a3379025039 --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production P99 Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts From 4af521f8fb7b5843680ee684847d885b162a98eb Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 21 Sep 2023 15:42:37 -0700 Subject: [PATCH 2/3] Add files to index --- documentation/meta/monitoring/runbooks/index.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/documentation/meta/monitoring/runbooks/index.md b/documentation/meta/monitoring/runbooks/index.md index 3c4745f0a52..78f335d9ee4 100644 --- a/documentation/meta/monitoring/runbooks/index.md +++ b/documentation/meta/monitoring/runbooks/index.md @@ -19,6 +19,10 @@ api_avg_response_time_above_threshold api_avg_response_time_anomaly api_p99_response_time_above_threshold api_p99_response_time_anomaly +api_thumbnails_avg_response_time_above_threshold +api_thumbnails_avg_response_time_anomaly +api_thumbnails_p99_response_time_above_threshold +api_thumbnails_p99_response_time_anomaly nuxt_request_count nuxt_2xx_under_threshold nuxt_5xx_above_threshold From febaf896d91cff229f1d9ac7c5f60eeab944ad27 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Fri, 22 Sep 2023 10:58:59 -0700 Subject: [PATCH 3/3] Threshold alarms are low severity if not anomalous --- ...thumbnails_avg_response_time_above_threshold.md | 14 ++++++++------ ...thumbnails_p99_response_time_above_threshold.md | 14 ++++++++------ 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md index 825cef056aa..b48e6d9fafe 100644 --- a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md @@ -9,12 +9,14 @@ Alarm link: ## Severity Guide -Confirm that there is not a total outage of the service. If not, the severity is -likely low. Check for a recent deployment that may have introduced the problem, -and rollback to the previous version. If not, check the request count and -general network activity. If abnormally high, refer to the [traffic analysis run -book][traffic_runbook] to identify and block any malicious traffic. - +If the avg response time is not [anomalously high][anomaly_alarm], the severity +is likely low. Check for a recent deployment that may have introduced the +problem, and rollback to the previous version. If not, check the request count +and general network activity. If abnormally high, refer to the [traffic analysis +run book][traffic_runbook] to identify and block any malicious traffic. + +[anomaly_alarm]: + https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high [traffic_runbook]: /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md index 870f9598285..557140e524d 100644 --- a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md @@ -9,12 +9,14 @@ Alarm link: ## Severity Guide -Confirm that there is not a total outage of the service. If not, the severity is -likely low. Check for a recent deployment that may have introduced the problem, -and rollback to the previous version. If not, check the request count and -general network activity. If abnormally high, refer to the [traffic analysis run -book][traffic_runbook] to identify and block any malicious traffic. - +If the P99 response time is not [anomalously high][anomaly_alarm], the severity +is likely low. Check for a recent deployment that may have introduced the +problem, and rollback to the previous version. If not, check the request count +and general network activity. If abnormally high, refer to the [traffic analysis +run book][traffic_runbook] to identify and block any malicious traffic. + +[anomaly_alarm]: + https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high [traffic_runbook]: /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md