Alcide admission causing Kubernetes timeouts

Summary

While having an Alcide workshop an admission controller was applied with the understanding that it wouldn't be disruptive. But it unfortunately was, causing all create and update operations on mainly deployments (but also apiGroups: ["", "apps", "batch", "extensions"]) to timeout. Suspecting the admission controller to being slow at responding.

Timeline

All times in CET.

Time	Event
2020-11-04 13:00	Alcide workshop started
2020-11-04 13:13	Applied disruptive Alcide admission controller
2020-11-04 14:41	Users reporting timeout issues when deploying to Hellman (Kubernetes cluster)
2020-11-04 15:04	Verification of the timeout issue
2020-11-04 15:15	Toggling disruptive Alcide admission controller to verify fix
2020-11-04 15:39	Permanently deleted disruptive Alcide admission controller

Contributing Factors

Doing a workshop using production as sandbox
Smoketesting not in place to catch breaking capabilities
Knowledge of Alcide wasn't distributed among team members because of pending deprecation
False safety with "re-applying" something that had already been there

Lessons Learned

New stuff is applied in sandbox clusters
Acquire Alcide knowledge required to maintain component until fully deprecated

Action Items

Monitor Kubernetes deployments for timeouts
Monitor 5xx status codes for Kube API server

Stats

Category: Kubernetes
Time to detection: 88 minutes
Time to recovery: 122 minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PM2020-008 - Alcide admission causing Kubernetes timeouts.md

PM2020-008 - Alcide admission causing Kubernetes timeouts.md

Alcide admission causing Kubernetes timeouts

Summary

Timeline

Contributing Factors

Lessons Learned

Action Items

Stats

Files

PM2020-008 - Alcide admission causing Kubernetes timeouts.md

Latest commit

History

PM2020-008 - Alcide admission causing Kubernetes timeouts.md

File metadata and controls

Alcide admission causing Kubernetes timeouts

Summary

Timeline

Contributing Factors

Lessons Learned

Action Items

Stats