Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incident log for few previous incidents #4871

Merged
merged 5 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion runbooks/makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
IMAGE := ministryofjustice/tech-docs-github-pages-publisher:v2
IMAGE := ministryofjustice/tech-docs-github-pages-publisher:v3

# Use this to run a local instance of the documentation site, while editing
.PHONY: preview
Expand Down
195 changes: 191 additions & 4 deletions runbooks/source/incident-log.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,202 @@ weight: 45

## Q3 2023 (July-September)

- **Mean Time to Repair**: 0h 0m
- **Mean Time to Repair**: 10h 55m

- **Mean Time to Resolve**: 0h 0m
- **Mean Time to Resolve**: 19h 21m

### Incident on 2023-09-18 15:12 - Lack of Disk space on nodes

- **Key events**
- First detected: 2023-09-18 13:42
- Incident declared: 2023-09-18 15:12
- Repaired: 2023-09-18 17:54
- Resolved 2023-09-20 19:18

- **Time to repair**: 4h 12m

- **Time to resolve**: 35h 36m

- **Identified**: User reported that they are seeing [ImagePull errors](https://mojdt.slack.com/archives/C57UPMZLY/p1695042194935169) no space left on device error

- **Impact**: Several nodes are experiencing a lack of disk space within the cluster. The deployments might not be scheduled consistently and may fail.

- **Context**:
- 2023-09-18 13:42 Team noticed [RootVolUtilisation-Critical](https://moj-digital-tools.pagerduty.com/incidents/Q0RP1GPOECB97R?utm_campaign=channel&utm_source=slack) in High-priority-alert channel
- 2023-09-18 14:03 User reported that they are seeing [ImagePull errors](https://mojdt.slack.com/archives/C57UPMZLY/p1695042194935169) no space left on device error
- 2023-09-18 14:27 Team were doing the EKS Module upgrade to 18 and draining the nodes. They were seeing numerous pods in Evicted and ContainerStateUnKnown state
- 2023-09-18 15:12 Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1695046332665969
- 2023-09-18 15.26 Compared the disk size allocated in old node and new node and identified that the new node was allocated only 20Gb of disk space
- 2023-09-18 15:34 Old default node group uncordoned
- 2023-09-18 15:35 New nodes drain started to shift workload back to old nodegroup
- 2023-09-18 17:54 Incident repaired
- 2023-09-19 10:30 Team started validating the fix and understanding the launch_template changes
- 2023-09-20 10:00 Team updated the fix on manager and later on live cluster
- 2023-09-20 12:30 Started draining the old node group
- 2023-09-20 15:04 There was some increased pod state of “ContainerCreating”
- 2023-09-20 15:25 There was increased number of `"failed to assign an IP address to container" eni error`. Checked the CNI logs `Unable to get IP address from CIDR: no free IP available in the prefix` Understood that this might be because of IP Prefix starving and some are freed when draining old nodes.
- 2023-09-20 19:18 All nodes drained and No pods are in errored state. The initial issue of disk space issue is resolved

- **Resolution**:
- Team identified that the disk space was reduced from 100Gb to 20Gb as part of EKS Module version 18 change
- Identified the code changes to launch template and applied the fix

- **Review actions**:
- Update runbook to compare launch template changes during EKS module upgrade
- Create Test setup to pull images similar to live with different sizes
- Update RootVolUtilisation alert runbook to check disk space config
- Scale coreDNS dynamically based on the number of nodes
- Investigate if we can use ipv6 to solve the IP Prefix starvation problem
- Add drift testing to identify when a terraform plan shows a change to the launch template
- Setup logging to view cni and ipamd logs and setup alerts to notify when there are errors related to IP Prefix starvation

### Incident on 2023-08-04 10:09 - Dropped logging in kibana

- **Key events**
- First detected: 2023-08-04 09:14
- Incident declared: 2023-08-04 10:09
- Repaired: 2023-08-10 12:28
- Resolved 2023-08-10 14:47

- **Time to repair**: 33h 14m

- **Time to resolve**: 35h 33m

- **Identified**: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.

- **Impact**: The Cloud Platform lose the application logs for a period of time.

- **Context**:
- 2023-08-04 09:14: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.
- 2023-08-04 10:03: Cloud Platform team started investigating the issue and restarted the fluebt-bit pods
- 2023-08-04 10:09: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1691140153374179
- 2023-08-04 12:03: Identified that the newer version fluent-bit has changes to the chunk drop strategy
- 2023-08-04 16:00: Team bumped the fluent-bit version to see any improvements
- 2023-08-07 10:30: Team regrouped and discuss troubleshooting steps
- 2023-08-07 12:05: Increased the fluent-bit memory buffer
- 2023-08-08 16:10: Implemented a fix to handle memory buffer overflow
- 2023-08-09 09:00: Merged the fix and deployed in Live
- 2023-08-10 11:42: Implemented to handle flush logs into smaller chunks
- 2023-08-10 12:28: Incident repaired
- 2023-08-10 14:47: Incident resolved

- **Resolution**:
- Team identified that the latest version of fluent-bit has changes to the chunk drop strategy
- Implemented a fix to handle memory buffer overflow by writing to the file system and handling flush logs into smaller chunks

- **Review actions**:
- Push notifications from logging clusters to #lower-priority-alerts [#4704](https://github.com/ministryofjustice/cloud-platform/issues/4704)
- Add integration test to check that logs are being sent to the logging cluster

### Incident on 2023-07-25 15:21 - Prometheus on live cluster DOWN

- **Key events**
- First detected: 2023-07-25 14:05
- Incident declared: 2023-07-25 15:21
- Repaired: 2023-07-25 15:55
- Resolved 2023-09-25 15:55

- **Time to repair**: 1h 50m

- **Time to resolve**: 1h 50m

- **Identified**: [PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN](https://mojdt.slack.com/archives/C8PF51AT0/p1690290348206639)

- **Impact**: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.

- **Context**:
- 2023-07-25 14:05 - PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server. Prometheus errored for Rule evaluation and Exit code 137
- 2023-07-25 14:09: Prometheus pod is in terminating state
- 2023-07-25 14:17: The node where prometheus is running went to Not Ready state
- 2023-07-25 14:22: Drain the monitoring node which moved the prometheus to the another monitoring node
- 2023-07-25 14:56: After moving to new node the prometheus restarted just after coming back and put the node to Node Ready State
- 2023-07-25 15:11: Comms went to cloud-platform-update on Prometheus was DOWN
- 2023-07-25 15:20: Team found that the node memory is spiking to 89% and decided to go for a bigger instance size
- 2023-07-25 15:21: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1690294889724869
- 2023-07-25 15:31: Changed the instance size to `r6i.4xlarge`
- 2023-07-25 15:50: Still the Prometheus restarted after running. Team found the recent prometheus pod was terminated with OOMKilled. Increased the memory limits 100Gi
- 2023-07-25 16:18: Updated the prometheus container limits:CPU - 12 core and 110 Gi Memory to accommodate the resource need for prometheus
- 2023-07-25 16:18: Incident repaired
- 2023-07-05 16:18: Incident resolved

- **Resolution**:
- Due to increase number of namespaces and prometheus rules, the prometheus server needed more memory. The instance size was not enough to keep the prometheus running.
- Updating the node type to double the cpu and memory and increasing the container resource limit of prometheus server resolved the issue

- **Review actions**:
- Add alert to monitor the node memory usage and if a pod is using up most of the node memory [#4538](https://github.com/ministryofjustice/cloud-platform/issues/4538)

### Incident on 2023-07-21 09:31 - VPC CNI not allocating IP addresses

- **Key events**
- First detected: 2023-07-21 08:15
- Incident declared: 2023-07-21 09:31
- Repaired: 2023-07-21 12:42
- Resolved 2023-07-21 12:42

- **Time to repair**: 4h 27m

- **Time to resolve**: 4h 27m

- **Identified**: User reported of seeing issues with new deployments in #ask-cloud-platform

- **Impact**: The service availability for CP applications may be degraded/at increased risk of failure.

- **Context**:
- 2023-07-21 08:15 - User reported of seeing issues with new deployments (stuck with ContainerCreating)
- 2023-07-21 09:00 - Team started to put together the list of all effected namespaces
- 2023-07-21 09:31 - Incident declared
- 2023-07-21 09:45 - Team identified that the issue was affected 6 nodes and added new nodes and and began to cordon/drain affected nodes
- 2023-07-21 12:35 - Compared cni settings on a 1.23 test cluster with live and found a setting was different
- 2023-07-21 12:42 - Set the command to enable Prefix Delegation on the live cluster
- 2023-07-21 12:42 - Incident repaired
- 2023-07-21 12:42 - Incident resolved

- **Resolution**:
- The issue was caused by a missing setting on the live cluster. The team added the setting to the live cluster and the issue was resolved

- **Review actions**:
- Add a test/check to ensure the IP address allocation is working as expected [#4669](https://github.com/ministryofjustice/cloud-platform/issues/4669)

## Q2 2023 (April-June)

- **Mean Time to Repair**: 0h 0m
- **Mean Time to Repair**: 0h 55m

- **Mean Time to Resolve**: 0h 55m

### Incident on 2023-06-06 11:00 - User services down

- **Key events**
- First detected: 2023-06-06 10:26
- Incident declared: 2023-06-06 11:00
- Repaired: 2023-06-06 11:21
- Resolved 2023-06-06 11:21

- **Time to repair**: 0h 55m

- **Time to resolve**: 0h 55m

- **Identified**: Several Users reported issues that the production pods are deleted all at once, and receiving pingdom alerts that their application is down for few minutes

- **Impact**: User services were down for few minutes

- **Context**:
- 2023-06-06 10:23 - User reported that their production pods are deleted all at once
- 2023-06-06 10:30 - Users reported that their services were back up and running.
- 2023-06-06 10:30 - Team found that the nodes are being recycled all at a time during the node instance type change
- 2023-06-06 10:50 - User reported that the DPS service is down because they couldnot authenticate into the service
- 2023-06-06 11:00 - Incident declared
- 2023-06-06 11:21 - User reported that the DPS service is back up and running
- 2023-06-06 11:21 - Incident repaired
- 2023-06-06 13:11 - Incident resolved

- **Resolution**:
- When the node instance type is changed, the nodes are recycled all at a time. This caused the pods to be deleted all at once.
- Raised a ticket with AWS asking the steps to update the node instance type without causing outage to the services.
- The instance type update is performed through terraform, hence the team will have to comeup with a plan and update runbook to perform these changes without downtime.

- **Mean Time to Resolve**: 0h 0m
- **Review actions**:
- Add a runbook for the steps to perform when changing the node instance type

## Q1 2023 (January-March)

Expand Down
Loading