Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thumbnail repsonse time runbooks #3053

Merged
merged 3 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Run Book: API Thumbnails Production Average Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+above+threshold>
```

## Severity Guide

If the avg response time is not [anomalously high][anomaly_alarm], the severity
is likely low. Check for a recent deployment that may have introduced the
problem, and rollback to the previous version. If not, check the request count
and general network activity. If abnormally high, refer to the [traffic analysis
run book][traffic_runbook] to identify and block any malicious traffic.

[anomaly_alarm]:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high
[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted
- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Run Book: API Thumbnails Production Average Response Time anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+Average+Response+Time+anomalously+high>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for a recent deployment that may have introduced the problem,
and rollback to the previous version. If not, check the request count and
general network activity. If abnormally high, refer to the [traffic analysis run
book][traffic_runbook] to identify and block any malicious traffic.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted
- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Run Book: API Thumbnails Production P99 Response Time above threshold

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+above+threshold>
```

## Severity Guide

If the P99 response time is not [anomalously high][anomaly_alarm], the severity
is likely low. Check for a recent deployment that may have introduced the
problem, and rollback to the previous version. If not, check the request count
and general network activity. If abnormally high, refer to the [traffic analysis
run book][traffic_runbook] to identify and block any malicious traffic.

[anomaly_alarm]:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high
[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted
- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Run Book: API Thumbnails Production P99 Response Time anomalously high

```{admonition} Metadata
Status: **Unstable**
Maintainer: @stacimc
Alarm link:
- <https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/API+Thumbnails+Production+P99+Response+Time+anomalously+high>
```

## Severity Guide

Confirm that there is not a total outage of the service. If not, the severity is
likely low. Check for a recent deployment that may have introduced the problem,
and rollback to the previous version. If not, check the request count and
general network activity. If abnormally high, refer to the [traffic analysis run
book][traffic_runbook] to identify and block any malicious traffic.

[traffic_runbook]:
/meta/monitoring/traffic/runbooks/identifying-and-blocking-traffic-anomalies.md

## Historical false positives

Nothing registered to date.

## Related incident reports

- 2023-09-05 at 22:15 UTC: Unhealthy thumbnail tasks restarted
- 2023-07-27 at 19:14 UTC: API Thumbnails unhealthy hosts
4 changes: 4 additions & 0 deletions documentation/meta/monitoring/runbooks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ api_avg_response_time_above_threshold
api_avg_response_time_anomaly
api_p99_response_time_above_threshold
api_p99_response_time_anomaly
api_thumbnails_avg_response_time_above_threshold
api_thumbnails_avg_response_time_anomaly
api_thumbnails_p99_response_time_above_threshold
api_thumbnails_p99_response_time_anomaly
nuxt_request_count
nuxt_2xx_under_threshold
nuxt_5xx_above_threshold
Expand Down