From 0210e19f0c8b0cf34aa3f0346e3c5849ca4b1a58 Mon Sep 17 00:00:00 2001 From: Mike Bell Date: Tue, 30 Jul 2024 08:25:46 +0100 Subject: [PATCH 1/4] feat: add incident log for 25-07-24 --- runbooks/source/incident-log.html.md.erb | 52 ++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb index ed37fb7a..8468ceb3 100644 --- a/runbooks/source/incident-log.html.md.erb +++ b/runbooks/source/incident-log.html.md.erb @@ -7,6 +7,58 @@ weight: 45 > Use the [mean-time-to-repair.rb] script to view performance metrics +## Q3 2024 (July-September) + +- **Mean Time to Repair**: 3h 8m + +- **Mean Time to Resolve**: 4h 9m + +### Incident on 2024-07-25 + +- **Key events** + - First deteceted: 2024-07-25 12:10 + - Incident declared: 2024-07-25 14:54 + - Repaired declared: 2024-07-25 15:18 + - Resolved 2024-07-25 16:19 + +- **Time to repair**: 3h 8m + +- **Time to resolve**: 4h 9m + +- **Identified**: User reported that Elasticsearch was no longer receiving logs + +- **Impact**: Elasticsearch and Opensearch did not recieve logs + +- **Context**: + - 2024-07-25 12:10: cp-live-app-logs - ClusterIndexWritesBlocked starts + - 2024-07-25 12:30: cp-live-app-logs - ClusterIndexWritesBlocked recovers + - 2024-07-25 12:50: cp-live-app-logs - ClusterIndexWritesBlocked recovers + - 2024-07-25 12:35: cp-live-app-logs - ClusterIndexWritesBlocked starts + - 2024-07-25 12:55: cp-live-app-logs - ClusterIndexWritesBlocked starts + - 2024-07-25 13:15: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts + - 2024-07-25 13:40: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts + - 2024-07-25 13:45: Kibana no longer receiving any logs + - 2024-07-25 14:27: User notifies team via #ask-cloud-platform that Kibana has not been receiving logs since 13:45. + - 2024-07-25 14:32: Initial investigation shows no problems in live monitoring namespace + - 2024-07-25 14:42: Google meet call started to triage + - 2024-07-25 14:54: Incident declared + - 2024-07-25 14:55: Logs from fluent-bit containers show “could not enqueue into the ring buffer” + - 2024-07-25 14:59: rollout restart of all fluent-bit containers, logs partially start flowing but after a few minutes show the same error message + - 2024-07-25 15:18: It is noted that Opensearch is out of disk space, this is increased from 8000 to 12000 + - 2024-07-25 15:58: Disk space increase is complete and we start seeing fluent-bit processing logs + - 2024-07-25 16:15: Remediation tasks are defined and started to action + - 2024-07-25 16:19: Incident declared resolved + +- **Resolution**: + - Opensearch disk space is increased from 8000 to 12000 + - Fluenbit is configured to not log to Opensearch + +- **Review actions**: + - [Opensearch and Elasticsearch index dating issues](https://github.com/ministryofjustice/cloud-platform/issues/5931) + - [Investigate fluent-bit "failed to flush chunk"](https://github.com/ministryofjustice/cloud-platform/issues/5930) + - [Document and fix live-2 logging](https://github.com/ministryofjustice/cloud-platform/issues/5929) + - [High priority alerts for Elasticsearch and Opensearch](https://github.com/ministryofjustice/cloud-platform/issues/5928) + ## Q1 2024 (January-April) - **Mean Time to Repair**: 3h 21m From 4af970234dbd474df59e73007c673e8de148b433 Mon Sep 17 00:00:00 2001 From: Mike Bell Date: Tue, 30 Jul 2024 09:09:42 +0100 Subject: [PATCH 2/4] Update runbooks/source/incident-log.html.md.erb Co-authored-by: Steve Williams <105657964+sj-williams@users.noreply.github.com> --- runbooks/source/incident-log.html.md.erb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb index 8468ceb3..4fd104f2 100644 --- a/runbooks/source/incident-log.html.md.erb +++ b/runbooks/source/incident-log.html.md.erb @@ -51,7 +51,7 @@ weight: 45 - **Resolution**: - Opensearch disk space is increased from 8000 to 12000 - - Fluenbit is configured to not log to Opensearch + - Fluentbit is configured to not log to Opensearch as a temporary measure whilst follow-up investigation work into root cause is carried out. - **Review actions**: - [Opensearch and Elasticsearch index dating issues](https://github.com/ministryofjustice/cloud-platform/issues/5931) From 457cddddd39453b4151128f7f5972a22b8b1ca23 Mon Sep 17 00:00:00 2001 From: Mike Bell Date: Tue, 30 Jul 2024 09:28:54 +0100 Subject: [PATCH 3/4] docs: update priorities and rewrite live2 ticket --- runbooks/source/incident-log.html.md.erb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb index 4fd104f2..c53ebee0 100644 --- a/runbooks/source/incident-log.html.md.erb +++ b/runbooks/source/incident-log.html.md.erb @@ -55,9 +55,9 @@ weight: 45 - **Review actions**: - [Opensearch and Elasticsearch index dating issues](https://github.com/ministryofjustice/cloud-platform/issues/5931) - - [Investigate fluent-bit "failed to flush chunk"](https://github.com/ministryofjustice/cloud-platform/issues/5930) - - [Document and fix live-2 logging](https://github.com/ministryofjustice/cloud-platform/issues/5929) - [High priority alerts for Elasticsearch and Opensearch](https://github.com/ministryofjustice/cloud-platform/issues/5928) + - [Re-introduce Opensearch in to Live logging](https://github.com/ministryofjustice/cloud-platform/issues/5929) + - [Investigate fluent-bit "failed to flush chunk"](https://github.com/ministryofjustice/cloud-platform/issues/5930) ## Q1 2024 (January-April) From 96f70bdf2b206cf778ac0859c920cbf150974bdc Mon Sep 17 00:00:00 2001 From: Mike Bell Date: Tue, 30 Jul 2024 09:32:31 +0100 Subject: [PATCH 4/4] docs: update impact --- runbooks/source/incident-log.html.md.erb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb index c53ebee0..f572fa75 100644 --- a/runbooks/source/incident-log.html.md.erb +++ b/runbooks/source/incident-log.html.md.erb @@ -27,7 +27,7 @@ weight: 45 - **Identified**: User reported that Elasticsearch was no longer receiving logs -- **Impact**: Elasticsearch and Opensearch did not recieve logs +- **Impact**: Elasticsearch and Opensearch did not recieve logs, this meant that we lost users logs for the period of the incident. These logs have not been recovered. - **Context**: - 2024-07-25 12:10: cp-live-app-logs - ClusterIndexWritesBlocked starts