Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

brandond · 2023-01-25T17:10:56Z

Applications (such as Rancher) that deploy validating webhooks with a fail-closed configuration can cause the cluster to fail to come up after a cold restart. ServiceLB causes K3s to fail to start because the webhook is unavailable which blocks the NS create, and the webhook won't be available until K3s finishes starting the pod hosting the webhook. The cluster is effectively unstartable unless servicelb is temporarily disabled.

Jan 25 11:18:28 rocky9-again k3s[28690]: time="2023-01-25T11:18:28-05:00" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)\": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable"

The above failure was due to inability to create the ServiceLB Namespace, but a similar issue probably exists for the ServiceLB ServiceAccount and other critical resources. It might be good to just have the cloud controller start up in a goroutine, and retry instead of failing immediately?

The current behavior of failing on startup if resource creation fails was added in #6181

It can be worked around by starting k3s with --disable-servicelb and then re-enabling it once the webhook pod is up, but that's not ideal.

The text was updated successfully, but these errors were encountered:

VestigeJ · 2023-02-15T00:01:09Z

##Environment Details
Reproduced using VERSION=v1.26.1+k3s1
Validated fix using COMMIT=9efa0797b7fe5df846639bd57c0e50054c035cb4

Infrastructure

Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150400.24.11-default x86_64 GNU/Linux 
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"

Cluster Configuration:

NAME               STATUS   ROLES                       AGE     VERSION
ip-12-12-8-8       Ready    control-plane,etcd,master   8m36s   v1.26.1+k3s1

Config.yaml:

write-kubeconfig-mode: 644
debug: true
token: mangocabbages
selinux: true
protect-kernel-defaults: true
cluster-init: true

Reproduced failure using latest v1.26.1 release

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ VERSION=v1.26.1+k3s1
$ sudo INSTALL_K3S_VERSION=$VERSION INSTALL_K3S_EXEC=server ./install-k3s.sh 
$ set_kubefig //KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ kgp -A //kubectl get pods -A
$ vim webhook.yaml //paste included yaml deployment from below
$ k apply -f webhook.yaml //kubectl apply the file creating the webhook
$ kg mutatingwebhookconfigurations -A //kubectl get
$ sudo k3s-killall.sh //kill the running pods
$ sudo systemctl restart k3s //restart k3s //attempt to restart
$ sudo systemctl status k3s //notice the service is stuck activating but fails
$ sudo journalctl -xeu k3s | grep webhook  //catch the error from journalctl as to why it's failing

Results:
$ sudo journalctl -xeu k3s

Feb 14 23:44:44 ip-12-12-8-8 k3s[8760]: W0214 23:44:44.648689    8760 dispatcher.go:196] Failed calling webhook, failing closed example.mutating.webhook.com: failed calling webhook "example.mutating.webhook.com": failed to call webhook: Post "https://example-mutating-webhook-svc.webhook.svc:443/mutate?timeout=10s": service "example-mutating-webhook-svc" not found
Feb 14 23:44:44 ip-12-12-8-8 k3s[8760]: time="2023-02-14T23:44:44Z" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"example.mutating.webhook.com\": failed to call webhook: Post \"https://example-mutating-webhook-svc.webhook.svc:443/mutate?timeout=10s\": service \"example-mutating-webhook-svc\" not found"

Validated patched behavior allows k3s to startup normally using COMMIT for v1.26.2

Validation Steps

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ COMMIT=9efa0797b7fe5df846639bd57c0e50054c035cb4
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh 
$ set_kubefig //KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ kgp -A //kubectl get pods -A
$ vim webhook.yaml //paste included yaml deployment from below
$ k apply -f webhook.yaml 
$ kg mutatingwebhookconfigurations -A
$ sudo systemctl restart k3s //k3s should start normally
$ sudo systemctl status k3s //observe k3s has a running status and the api is responsive

Additional context / logs:

$ cat webhook.yaml

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: example-mutating-webhook-config
webhooks:
  - name: example.mutating.webhook.com
    clientConfig:
      service:
        namespace: webhook
        name: example-mutating-webhook-svc
        path: /mutate
    admissionReviewVersions:
      - v1
      - v1beta1
    rules:
      - operations: [ "CREATE", "UPDATE" ]
        apiGroups: [ "" ]
        apiVersions: [ "v1" ]
        resources: [ "pods", "namespaces", "serviceaccounts" ]
    failurePolicy: Fail
    sideEffects: None

brandond self-assigned this Jan 25, 2023

brandond added this to the v1.26.2+k3s1 milestone Jan 25, 2023

brandond added this to K3s Development Jan 25, 2023

brandond moved this to Peer Review in K3s Development Jan 25, 2023

brandond moved this from Peer Review to Working in K3s Development Jan 25, 2023

brandond changed the title ~~Fail-closed webhooks blocking Namespace creates/writes can cause k3s to be unstartable~~ Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable Jan 25, 2023

brandond mentioned this issue Feb 7, 2023

Fix issue with servicelb startup failure when validating webhooks block creation #6911

Merged

brandond moved this from Working to To Test in K3s Development Feb 7, 2023

bguzman-3pillar assigned VestigeJ Feb 7, 2023

VestigeJ closed this as completed Feb 15, 2023

github-project-automation bot moved this from To Test to Done Issue in K3s Development Feb 15, 2023

jakefhyde mentioned this issue Mar 14, 2023

[BUG] Restoring a v1.25 RKE2/K3s cluster back to its original v1.24 K8s version with 1 etcd/cp and 3 worker nodes causes cluster to be stuck in Updating rancher/rancher#40843

Closed

snasovich mentioned this issue Mar 15, 2023

[Backport v2.6] [BUG] Docker install of Rancher instance on Digital Ocean for 2.7-head on restart loop rancher/rancher#40665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

brandond commented Jan 25, 2023 •

edited

Loading

VestigeJ commented Feb 15, 2023

Reproduced failure using latest v1.26.1 release

Validated patched behavior allows k3s to startup normally using COMMIT for v1.26.2

Validation Steps

Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

Comments

brandond commented Jan 25, 2023 • edited Loading

VestigeJ commented Feb 15, 2023

Reproduced failure using latest v1.26.1 release

Validated patched behavior allows k3s to startup normally using COMMIT for v1.26.2

Validation Steps

brandond commented Jan 25, 2023 •

edited

Loading