From 00214f7c0cf624c690d5d4794746ba88972740f2 Mon Sep 17 00:00:00 2001 From: Staci Cooper Date: Thu, 21 Sep 2023 15:03:59 -0700 Subject: [PATCH] Add thumb repsonse time runbooks --- ...nails_avg_response_time_above_threshold.md | 28 +++++++++++++++++++ ...pi_thumbnails_avg_response_time_anomaly.md | 28 +++++++++++++++++++ ...nails_p99_response_time_above_threshold.md | 28 +++++++++++++++++++ ...pi_thumbnails_p99_response_time_anomaly.md | 28 +++++++++++++++++++ 4 files changed, 112 insertions(+) create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md create mode 100644 documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md new file mode 100644 index 00000000000..825cef056aa --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_above_threshold.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production Average Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md new file mode 100644 index 00000000000..d556118fb5a --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_avg_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production Average Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md new file mode 100644 index 00000000000..870f9598285 --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_above_threshold.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production P99 Response Time above threshold + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts diff --git a/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md new file mode 100644 index 00000000000..a3379025039 --- /dev/null +++ b/documentation/meta/monitoring/runbooks/api_thumbnails_p99_response_time_anomaly.md @@ -0,0 +1,28 @@ +# Run Book: API Thumbnails Production P99 Response Time anomalously high + +```{admonition} Metadata +Status: **Unstable** +Maintainer: @stacimc +Alarm link: +- +``` + +## Severity Guide + +Confirm that there is not a total outage of the service. If not, the severity is +likely low. Check for a recent deployment that may have introduced the problem, +and rollback to the previous version. If not, check the request count and +general network activity. If abnormally high, refer to the [traffic analysis run +book][traffic_runbook] to identify and block any malicious traffic. + +[traffic_runbook]: + /meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md + +## Historical false positives + +Nothing registered to date. + +## Related incident reports + +- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted +- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts