From 4fff8d8eb30e2462fce9b3babcb9bebe1fe1d7eb Mon Sep 17 00:00:00 2001
From: Poornima Krishnasamy <poornima.krishnasamy@digital.justice.gov.uk>
Date: Thu, 1 Feb 2024 09:46:02 +0000
Subject: [PATCH 1/2] Add incident log for prometheus restart with readiness
 probe failure

---
 runbooks/source/incident-log.html.md.erb | 45 ++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb
index f8c6fc3c..1eda77b6 100644
--- a/runbooks/source/incident-log.html.md.erb
+++ b/runbooks/source/incident-log.html.md.erb
@@ -7,6 +7,51 @@ weight: 45
 
 > Use the [mean-time-to-repair.rb] script to view performance metrics
 
+## Q4 2023 (October-December)
+
+- **Mean Time to Repair**: 35h 36m
+
+- **Mean Time to Resolve**: 35h 36m
+
+### Incident on 2023-11-01 10:41 - Prometheus restarted several times which resulted in missing metrics
+
+- **Key events**
+  - First detected: 2023-11-01 10:15
+  - Incident declared: 2023-11-01 10:41
+  - Repaired: 2023-11-03 14:38 
+  - Resolved 2023-11-03 14:38
+
+- **Time to repair**: 35h 36m
+
+- **Time to resolve**: 35h 36m
+
+- **Identified**: [PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN](https://mojdt.slack.com/archives/C8PF51AT0/p1698833753414539)
+
+- **Impact**: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.
+
+
+- **Context**:
+  - 2023-11-01 10:15: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server.
+  - 2023-11-01 10:41: PagerDuty for Prometheus alerted 3rd time in a row in just few minutes interval. Incident declared 
+  - 2023-11-01 10:41: Prometheus pod has restarted and the prometheus container is starting
+  - 2023-11-01 10:41: Prometheus logs shows there are numerous Evaluation rule failed
+  - 2023-11-01 10:41: Events in monitoring namespace recorded Readiness Probe failed for Prometheus
+  - 2023-11-01 12:35: Team enabled debug log level for prometheus to understand the issue
+  - 2023-11-03 16:01: After investigating the logs, team found that one possible root cause might be the readiness Probe failure prior to the restart of prometheus. Hence team increased the readiness probe timeout
+  - 2023-11-03 16:01: Incident repaired and resolved.
+
+- **Resolution**:
+  - Team identified that the readiness probe was failing and the prometheus was restarted. 
+  - Increased the readiness probe timeout from 3 to 6 seconds to avoid the restart of prometheus
+
+- **Review actions**:
+  - Team discussed about having closer inspection and try to identify these kind of failures earlier
+  - Investigate if the ingestion of data to the database too big or long
+  - Is executing some queries make prometheus work harder and stop responding to the readiness probe
+  - Any other services which is probing prometheus that triggers the restart
+  - Is taking regular velero backups distrub the ebs read/write and cause the restart
+
+
 ## Q3 2023 (July-September)
 
 - **Mean Time to Repair**: 10h 55m

From 27035f2b7d4ac8d21a4ee3332c0b45a4d073824c Mon Sep 17 00:00:00 2001
From: "github-actions[bot]" <github-actions[bot]@users.noreply.github.com>
Date: Thu, 1 Feb 2024 09:46:48 +0000
Subject: [PATCH 2/2] Commit changes made by code formatters

---
 runbooks/source/incident-log.html.md.erb | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb
index 1eda77b6..8cffcab3 100644
--- a/runbooks/source/incident-log.html.md.erb
+++ b/runbooks/source/incident-log.html.md.erb
@@ -18,7 +18,7 @@ weight: 45
 - **Key events**
   - First detected: 2023-11-01 10:15
   - Incident declared: 2023-11-01 10:41
-  - Repaired: 2023-11-03 14:38 
+  - Repaired: 2023-11-03 14:38
   - Resolved 2023-11-03 14:38
 
 - **Time to repair**: 35h 36m
@@ -29,10 +29,9 @@ weight: 45
 
 - **Impact**: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.
 
-
 - **Context**:
   - 2023-11-01 10:15: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server.
-  - 2023-11-01 10:41: PagerDuty for Prometheus alerted 3rd time in a row in just few minutes interval. Incident declared 
+  - 2023-11-01 10:41: PagerDuty for Prometheus alerted 3rd time in a row in just few minutes interval. Incident declared
   - 2023-11-01 10:41: Prometheus pod has restarted and the prometheus container is starting
   - 2023-11-01 10:41: Prometheus logs shows there are numerous Evaluation rule failed
   - 2023-11-01 10:41: Events in monitoring namespace recorded Readiness Probe failed for Prometheus
@@ -41,7 +40,7 @@ weight: 45
   - 2023-11-03 16:01: Incident repaired and resolved.
 
 - **Resolution**:
-  - Team identified that the readiness probe was failing and the prometheus was restarted. 
+  - Team identified that the readiness probe was failing and the prometheus was restarted.
   - Increased the readiness probe timeout from 3 to 6 seconds to avoid the restart of prometheus
 
 - **Review actions**:
@@ -51,7 +50,6 @@ weight: 45
   - Any other services which is probing prometheus that triggers the restart
   - Is taking regular velero backups distrub the ebs read/write and cause the restart
 
-
 ## Q3 2023 (July-September)
 
 - **Mean Time to Repair**: 10h 55m