Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable #6815

Closed
brandond opened this issue Jan 25, 2023 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Member

brandond commented Jan 25, 2023

Applications (such as Rancher) that deploy validating webhooks with a fail-closed configuration can cause the cluster to fail to come up after a cold restart. ServiceLB causes K3s to fail to start because the webhook is unavailable which blocks the NS create, and the webhook won't be available until K3s finishes starting the pod hosting the webhook. The cluster is effectively unstartable unless servicelb is temporarily disabled.

Jan 25 11:18:28 rocky9-again k3s[28690]: time="2023-01-25T11:18:28-05:00" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)\": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable"

The above failure was due to inability to create the ServiceLB Namespace, but a similar issue probably exists for the ServiceLB ServiceAccount and other critical resources. It might be good to just have the cloud controller start up in a goroutine, and retry instead of failing immediately?

The current behavior of failing on startup if resource creation fails was added in #6181

It can be worked around by starting k3s with --disable-servicelb and then re-enabling it once the webhook pod is up, but that's not ideal.

@brandond brandond self-assigned this Jan 25, 2023
@brandond brandond added this to the v1.26.2+k3s1 milestone Jan 25, 2023
@brandond brandond moved this to Peer Review in K3s Development Jan 25, 2023
@brandond brandond moved this from Peer Review to Working in K3s Development Jan 25, 2023
@brandond brandond changed the title Fail-closed webhooks blocking Namespace creates/writes can cause k3s to be unstartable Fail-closed webhooks blocking resource creates/writes can cause k3s to be unstartable Jan 25, 2023
@brandond brandond moved this from Working to To Test in K3s Development Feb 7, 2023
@VestigeJ
Copy link

##Environment Details
Reproduced using VERSION=v1.26.1+k3s1
Validated fix using COMMIT=9efa0797b7fe5df846639bd57c0e50054c035cb4

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150400.24.11-default x86_64 GNU/Linux 
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"

Cluster Configuration:

NAME               STATUS   ROLES                       AGE     VERSION
ip-12-12-8-8       Ready    control-plane,etcd,master   8m36s   v1.26.1+k3s1 

Config.yaml:

write-kubeconfig-mode: 644
debug: true
token: mangocabbages
selinux: true
protect-kernel-defaults: true
cluster-init: true

Reproduced failure using latest v1.26.1 release

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ VERSION=v1.26.1+k3s1
$ sudo INSTALL_K3S_VERSION=$VERSION INSTALL_K3S_EXEC=server ./install-k3s.sh 
$ set_kubefig //KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ kgp -A //kubectl get pods -A
$ vim webhook.yaml //paste included yaml deployment from below
$ k apply -f webhook.yaml //kubectl apply the file creating the webhook
$ kg mutatingwebhookconfigurations -A //kubectl get
$ sudo k3s-killall.sh //kill the running pods
$ sudo systemctl restart k3s //restart k3s //attempt to restart
$ sudo systemctl status k3s //notice the service is stuck activating but fails
$ sudo journalctl -xeu k3s | grep webhook  //catch the error from journalctl as to why it's failing

Results:
$ sudo journalctl -xeu k3s

Feb 14 23:44:44 ip-12-12-8-8 k3s[8760]: W0214 23:44:44.648689    8760 dispatcher.go:196] Failed calling webhook, failing closed example.mutating.webhook.com: failed calling webhook "example.mutating.webhook.com": failed to call webhook: Post "https://example-mutating-webhook-svc.webhook.svc:443/mutate?timeout=10s": service "example-mutating-webhook-svc" not found
Feb 14 23:44:44 ip-12-12-8-8 k3s[8760]: time="2023-02-14T23:44:44Z" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"example.mutating.webhook.com\": failed to call webhook: Post \"https://example-mutating-webhook-svc.webhook.svc:443/mutate?timeout=10s\": service \"example-mutating-webhook-svc\" not found"

Validated patched behavior allows k3s to startup normally using COMMIT for v1.26.2

Validation Steps

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ COMMIT=9efa0797b7fe5df846639bd57c0e50054c035cb4
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh 
$ set_kubefig //KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ kgp -A //kubectl get pods -A
$ vim webhook.yaml //paste included yaml deployment from below
$ k apply -f webhook.yaml 
$ kg mutatingwebhookconfigurations -A
$ sudo systemctl restart k3s //k3s should start normally
$ sudo systemctl status k3s //observe k3s has a running status and the api is responsive

Additional context / logs:

$ cat webhook.yaml

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: example-mutating-webhook-config
webhooks:
  - name: example.mutating.webhook.com
    clientConfig:
      service:
        namespace: webhook
        name: example-mutating-webhook-svc
        path: /mutate
    admissionReviewVersions:
      - v1
      - v1beta1
    rules:
      - operations: [ "CREATE", "UPDATE" ]
        apiGroups: [ "" ]
        apiVersions: [ "v1" ]
        resources: [ "pods", "namespaces", "serviceaccounts" ]
    failurePolicy: Fail
    sideEffects: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants