Kubernetes Failure Stories

A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.

Total DNS outage in Kubernetes cluster - Zalando - postmortem 2019
- involved: AWS, DNS, CoreDNS, OOMKill, ndots:5, HTTP retries
- impact: production outage
Maximize learnings from a Kubernetes cluster failure - NU.nl - blog post 2019
- involved: AWS, NotReady nodes, SystemOOM, Helm, ElastAlert, no resource limits set
- impact: user experience affected for internally used tools and dashboards
Kubernetes Load Balancer Configuration - Beware when draining nodes - DevOps Hof - blog post 2019
- involved: GCP Load Balancer, externalTrafficPolicy, ingress-nginx
- impact: total ingress traffic outage
On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
- involved: on-premise, Kafka, large cluster, Consul, Docker daemon, high CPU usage
- impact: development environment outage
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
- involved: AWS, Ingress, CronJob, etcd, flannel, Docker, CPU throttling
- impact: production outages
Outages? Downtime? - Veracode - blog post 2018
- involved: AWS, AWS IAM, region migration, kubespray, Terraform, pod CIDR
- impact: QA/dev cluster outage
NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
- involved: GCP, kubeadm, etcd, Terraform, livenessProbe
- impact: production outage
A Perfect DNS Storm - Toyota Connected - blog post 2018
- involved: Azure, DNS, ndots:5, Alpine musl libc
- impact: DNS resolution failures
Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
- involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet, NotReady nodes
- impact: 15 minutes cluster outage
Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
- involved: AWS, kops, HAProxy, livenessProbe, DNS, too many open files
- impact: unknown outages, DNS errors
AirMap Platform Service Outage - AirMap - incident report 2018
- involved: Azure, NotReady nodes, kubelet PLEG, CNI
- impact: production AirMap platform outage
Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
- involved: AWS, etcd, Linkerd, NullPointerException, gRPC client, services without endpoints, incompatible Kubernetes API change
- impact: production ledger/platform outage
101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
- involved: on-premise, namespace deletion, domain name collision, NotReady nodes, etcd empty dir, TLS certificate refresh, DNS issues, OOM
- impact: unknown cluster outages
101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
- involved: AWS, NotReady nodes, OOM, eviction thresholds, ELB dynamic IPs, kubelet, cluster autoscaler, etcd split
- impact: full production cluster outage, other outages
Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017
- involved: AWS, etcd, Linkerd, NullPointerException, services without endpoints
- impact: major production outage, full platform outage, current account payments fail
Fallacies of Distributed Computing with Kubernetes on AWS - Zalando - AWS User Group Hamburg October 2017
- involved: AWS, unhealthy nodes, Ingress, CronJob
- impact: production outage
Search and Reporting Outage - Universe - incident report 2017
- involved: Job, RestartPolicy, consume node resources
- impact: production Universe search and reporting outage
Our First Kubernetes Outage - Saltside - blog post 2017
- involved: AWS, kops, Helm, NotReady nodes, resource exhaustion
- impact: nonproduction cluster outage
Our Failure Migrating to Kubernetes - Saltside - blog post 2017
- involved: AWS, kops, ELB, BackendConnectionErrors, LoadBalancer service
- impact: aborted application migration
SaleMove US System Issue - SaleMove - incident report 2017
- involved: AWS, ELB dynamic IPs, DNS A record for master, API server
- impact: production issues with SaleMove US System

Why

Kubernetes is a fairly complex system with many moving parts. Its ecosystem is constantly evolving and adding even more layers (service mesh, ...) to the mix. Considering this environment, we don't hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production. For more information, see the blog post.

Contributing

Please help the community and share a link to your failure story by opening a Pull Request! Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweetstorms, ...

I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes Failure Stories

Why

Contributing

About

Releases

Packages

staleks/kubernetes-failure-stories

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Failure Stories

Why

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages