diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md new file mode 100644 index 00000000000..b48e6d9fafe --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md @@ -0,0 +1,30 @@ +# Run Book: API Thumbnails Production Average Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +If the avg response time is not [anomalously high][anomaly_alarm], the severity +is likely low. Check for a recent deployment that may have introduced the +problem, and rollback to the previous version. If not, check the request count +and general network activity. If abnormally high, refer to the [traffic analysis +run book][traffic_runbook] to identify and block any malicious traffic. + +[anomaly_alarm]: + https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md new file mode 100644 index 00000000000..d556118fb5a --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production Average Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md new file mode 100644 index 00000000000..557140e524d --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md @@ -0,0 +1,30 @@ +# Run Book: API Thumbnails Production P99 Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +If the P99 response time is not [anomalously high][anomaly_alarm], the severity +is likely low. Check for a recent deployment that may have introduced the +problem, and rollback to the previous version. If not, check the request count +and general network activity. If abnormally high, refer to the [traffic analysis +run book][traffic_runbook] to identify and block any malicious traffic. + +[anomaly_alarm]: + https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md new file mode 100644 index 00000000000..a3379025039 --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production P99 Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/index.md b/documentation/meta/monitoring/runbooks/index.md index 3c4745f0a52..78f335d9ee4 100644 --- a/documentation/meta/monitoring/runbooks/index.md +++ b/documentation/meta/monitoring/runbooks/index.md @@ -19,6 +19,10 @@ api_avg_response_time_above_threshold api_avg_response_time_anomaly api_p99_response_time_above_threshold api_p99_response_time_anomaly +api_thumbnails_avg_response_time_above_threshold +api_thumbnails_avg_response_time_anomaly +api_thumbnails_p99_response_time_above_threshold +api_thumbnails_p99_response_time_anomaly nuxt_request_count nuxt_2xx_under_threshold nuxt_5xx_above_threshold