-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815
Milestone
Comments
brandond
changed the title
Fail-closed webhooks blocking Namespace creates/writes can cause k3s to be unstartable
Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable
Jan 25, 2023
This was referenced Jan 25, 2023
##Environment Details Infrastructure
Node(s) CPU architecture, OS, and version:
Cluster Configuration:
Config.yaml:
Reproduced failure using latest v1.26.1 release
Results:
Validated patched behavior allows k3s to startup normally using COMMIT for v1.26.2Validation Steps
Additional context / logs: $ cat webhook.yaml
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Applications (such as Rancher) that deploy validating webhooks with a fail-closed configuration can cause the cluster to fail to come up after a cold restart. ServiceLB causes K3s to fail to start because the webhook is unavailable which blocks the NS create, and the webhook won't be available until K3s finishes starting the pod hosting the webhook. The cluster is effectively unstartable unless servicelb is temporarily disabled.
Jan 25 11:18:28 rocky9-again k3s[28690]: time="2023-01-25T11:18:28-05:00" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)\": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable"
The above failure was due to inability to create the ServiceLB Namespace, but a similar issue probably exists for the ServiceLB ServiceAccount and other critical resources. It might be good to just have the cloud controller start up in a goroutine, and retry instead of failing immediately?
The current behavior of failing on startup if resource creation fails was added in #6181
It can be worked around by starting k3s with
--disable-servicelb
and then re-enabling it once the webhook pod is up, but that's not ideal.The text was updated successfully, but these errors were encountered: