-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node isn't deleted from kube after deleted in panel or through cli #142
Comments
I tried your reproduction script 3 times (without the poweroff/poweron steps) and each time the node was removed after a few minutes. The process of detecting a removed node usually works like this:
I don't see a lot that could go wrong on the cloud controller manager side - how long did you wait for the node to disappear? If you nevertheless think this is an issue, can you find out which value k3s uses for |
@mfrister I tried it today again and with the first shot last output:
Waiting a little bit more doesn't change the situation:
But after more time the node is removed:
Second time was everything fine, but the third time I give up:
I'm not sure about that, but with
|
How long did you wait when you gave up? A cloud controller manager and the kube-controller-manager are two different things. The latter runs most of the important Kubernetes control loops and is still running on your cluster, even if the Kubernetes-built-in cloud controller managers are disabled. So |
Oh, sorry, I thought it was clear from the output. It was at least 65 minutes.
I can't find anything about that:
|
You've linked ccm. Here is cm: https://github.com/k3s-io/k3s/blob/v1.19.5+k3s2/pkg/daemons/control/server.go#L130 . I can't see nothing special. But the log output is: Running kube-controller-manager --address=127.0.0.1 --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --port=10252 --profiling=false --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true Be aware, that now 1.20 is marked as stable. |
Good point, while k3s sets After 65 minutes, the node should be gone. I've tried to reproduce the issue again and at some point I ended up in the situation you described - the node didn't get removed. When digging into k3s logs, I found this on one of the other servers:
hcloud-cloud-controller-manager gets an event from apiserver:
Still, the node appears in later requests to list nodes, indicating that it hasn't actually been deleted from etcd (or state in etcd is not consistent between the two remaining nodes). I then realized that you're running 3 Kubernetes masters with k3s' embedded etcd. My current guess is that node removal fails due to apiserver being unable to write to etcd, which doesn't seem to be able to deal with removed nodes in the configuration you're using (maybe this is a problem with k3s' embedded etcd in general). There are additional indications that k3s' embedded etcd may have problems with node removal:
With all these problems seemingly related to k3s' use of etcd, I believe this is a k3s issue and unrelated to our cloud controller manager, which correctly tells Kubernets that the node doesn't exist. |
@mfrister thank you very much for your help. Seem like a dupe of k3s-io/k3s#2669 . |
Repro:
Sometimes the last output is:
instead of:
Note:
poweroff
/poweron
aren't necessary, but it makes repro more stable.The text was updated successfully, but these errors were encountered: