Node isn't deleted from kube after deleted in panel or through cli #142

Bessonov · 2020-12-31T00:29:06Z

Repro:

#!/usr/bin/env bash
set -euo pipefail
set -x

export HCLOUD_TOKEN=...
SERVER_KEY=...

CHANNEL=stable # ATM v1.19.5+k3s2
SERVER_TYPE=cx11

SSH_OPS='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectionAttempts=10'

HCLOUD=hcloud

# setup cluster
for node in k3s-1 k3s-2 k3s-3; do
  $HCLOUD server create --name "$node" --image ubuntu-20.04 --datacenter fsn1-dc14 --type "${SERVER_TYPE}" --ssh-key "$SERVER_KEY"
  SERVER_ID=$($HCLOUD server describe "$node" | grep '^ID:' | awk '{print $2}')
  KUBELET_CONF="--disable-cloud-controller --disable servicelb --kubelet-arg cloud-provider=external --kubelet-arg provider-id=hcloud://$SERVER_ID"
  if [ "$node" == "k3s-1" ]; then
    $HCLOUD server ssh $node $SSH_OPS "curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=$CHANNEL K3S_TOKEN=SERVER_SECRET INSTALL_K3S_EXEC='server --cluster-init $KUBELET_CONF' sh -"
    $HCLOUD server ssh $node $SSH_OPS "kubectl -n kube-system create secret generic hcloud --from-literal=token=$HCLOUD_TOKEN"
    $HCLOUD server ssh $node $SSH_OPS "kubectl apply -f  https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/deploy/ccm.yaml"
  elif [ "$node" == "k3s-2" ] || [ "$node" == "k3s-3" ]; then
    $HCLOUD server ssh $node $SSH_OPS "curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=$CHANNEL K3S_TOKEN=SERVER_SECRET INSTALL_K3S_EXEC='server --server https://'$($HCLOUD server ip k3s-1)':6443 $KUBELET_CONF' sh -"
  fi
done

sleep 180

# just ensure everything is OK
echo -e "\e[32mExpect Ready:\e[0m"
$HCLOUD server ssh k3s-1 $SSH_OPS "kubectl get nodes"

# $HCLOUD server poweroff k3s-3
# sleep 120

# echo -e "\e[32mExpect NotReady:\e[0m"
# $HCLOUD server poweron k3s-3
# $HCLOUD server ssh k3s-1 $SSH_OPS "kubectl get nodes"
# sleep 120

# after power on and some time after that, cluster is stable again
# echo -e "\e[32mExpect Ready:\e[0m"
$HCLOUD server delete k3s-3
# $HCLOUD server ssh k3s-1 $SSH_OPS "kubectl get nodes"
sleep 120

echo -e "\e[32mExpect absent:\e[0m"
$HCLOUD server ssh k3s-1 $SSH_OPS "kubectl get nodes"

Sometimes the last output is:

NAME    STATUS     ROLES         AGE     VERSION
k3s-1   Ready      etcd,master   7m      v1.19.5+k3s2
k3s-2   Ready      etcd,master   5m59s   v1.19.5+k3s2
k3s-3   NotReady   etcd,master   5m4s    v1.19.5+k3s2

instead of:

NAME    STATUS   ROLES         AGE     VERSION
k3s-1   Ready    etcd,master   6m46s   v1.19.5+k3s2
k3s-2   Ready    etcd,master   5m51s   v1.19.5+k3s2

Note: poweroff/poweron aren't necessary, but it makes repro more stable.

The text was updated successfully, but these errors were encountered:

mfrister · 2021-01-04T09:48:03Z

I tried your reproduction script 3 times (without the poweroff/poweron steps) and each time the node was removed after a few minutes.

The process of detecting a removed node usually works like this:

Nodes regularly send a heartbeat to a master (every NodeStatusUpdateFrequency)
If a master hasn't received a heartbeat for NodeMonitorGracePeriod, controller manager marks the node as NotReady.
Another controller manager process verifies with the cloud controller manager whether NotReady nodes still exist.
If the cloud controller manager reports the node doesn't exist, Kubernetes removes it.

I don't see a lot that could go wrong on the cloud controller manager side - how long did you wait for the node to disappear?

If you nevertheless think this is an issue, can you find out which value k3s uses for NodeMonitorGracePeriod? It seems to increase the value of NodeStatusUpdateFrequency from the default Kubernetes value of 10 seconds to 1 minute, so I expect the NodeMonitorGracePeriod to be higher than the Kubernetes default of 40 seconds as well. This could mean that with k3s, nodes might simply take a bit longer to disappear.

Bessonov · 2021-01-04T22:23:18Z

@mfrister I tried it today again and with the first shot last output:

Expect absent:
+ hcloud server ssh k3s-1 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ConnectionAttempts=10 'kubectl get nodes'
NAME    STATUS     ROLES         AGE     VERSION
k3s-1   Ready      etcd,master   8m4s    v1.19.5+k3s2
k3s-2   Ready      etcd,master   6m15s   v1.19.5+k3s2
k3s-3   NotReady   etcd,master   5m3s    v1.19.5+k3s2

Waiting a little bit more doesn't change the situation:

root@k3s-1:~# kubectl get nodes
NAME    STATUS     ROLES         AGE   VERSION
k3s-1   Ready      etcd,master   17m   v1.19.5+k3s2
k3s-2   Ready      etcd,master   15m   v1.19.5+k3s2
k3s-3   NotReady   etcd,master   14m   v1.19.5+k3s2

But after more time the node is removed:

# kubectl get nodes
NAME    STATUS   ROLES         AGE   VERSION
k3s-1   Ready    etcd,master   29m   v1.19.5+k3s2
k3s-2   Ready    etcd,master   27m   v1.19.5+k3s2

Second time was everything fine, but the third time I give up:

# kubectl get nodes
NAME    STATUS     ROLES         AGE   VERSION
k3s-1   Ready      etcd,master   69m   v1.19.5+k3s2
k3s-2   Ready      etcd,master   68m   v1.19.5+k3s2
k3s-3   NotReady   etcd,master   67m   v1.19.5+k3s2

If you nevertheless think this is an issue, can you find out which value k3s uses for NodeMonitorGracePeriod?

I'm not sure about that, but with --disable-cloud-controller flag the default ccm should be disabled and only hetzner ccm is deployed:

# kubectl -n kube-system get pods
NAME                                              READY   STATUS      RESTARTS   AGE
coredns-66c464876b-9qzh8                          1/1     Running     0          26m
hcloud-cloud-controller-manager-7cf4f5974-8dsj4   1/1     Running     0          26m
helm-install-traefik-lfwhx                        0/1     Completed   0          26m
local-path-provisioner-7ff9579c6-985m2            1/1     Running     0          26m
metrics-server-7b4f8b595-wpzcw                    1/1     Running     0          26m
traefik-5dd496474-f2fk2                           1/1     Running     0          26m

mfrister · 2021-01-05T07:12:13Z

@Bessonov

How long did you wait when you gave up?

A cloud controller manager and the kube-controller-manager are two different things. The latter runs most of the important Kubernetes control loops and is still running on your cluster, even if the Kubernetes-built-in cloud controller managers are disabled. So NodeMonitorGracePeriod is relevant, unless k3s changes something about it.

Bessonov · 2021-01-05T10:36:28Z

@mfrister

How long did you wait when you gave up?

Oh, sorry, I thought it was clear from the output. It was at least 65 minutes.

So NodeMonitorGracePeriod is relevant, unless k3s changes something about it.

I can't find anything about that:

# cat /etc/systemd/system/k3s.service
...
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
	'--cluster-init' \
	'--disable-cloud-controller' \
	'--disable' \
	'servicelb' \
	'--kubelet-arg' \
	'cloud-provider=external' \
	'--kubelet-arg' \
	'provider-id=hcloud://9365506' \

# ll /etc/systemd/system/
total 64
drwxr-xr-x 14 root root 4096 Jan  5 11:33 ./
drwxr-xr-x  5 root root 4096 Dec  7 13:52 ../
drwxr-xr-x  2 root root 4096 Dec  7 13:57 cloud-init.target.wants/
lrwxrwxrwx  1 root root   44 Dec  7 13:48 dbus-org.freedesktop.resolve1.service -> /lib/systemd/system/systemd-resolved.service
lrwxrwxrwx  1 root root   45 Dec  7 13:48 dbus-org.freedesktop.timesync1.service -> /lib/systemd/system/systemd-timesyncd.service
drwxr-xr-x  2 root root 4096 Dec  7 13:48 default.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:51 emergency.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:48 getty.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:50 graphical.target.wants/
-rw-r--r--  1 root root  860 Jan  3 13:37 k3s.service
-r--------  1 root root   24 Jan  3 13:37 k3s.service.env
drwxr-xr-x  2 root root 4096 Dec  7 13:55 mdmonitor.service.wants/
drwxr-xr-x  2 root root 4096 Jan  3 13:37 multi-user.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:55 paths.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:51 rescue.target.wants/
drwxr-xr-x  2 root root 4096 Dec  7 13:55 sockets.target.wants/
lrwxrwxrwx  1 root root   31 Dec  7 13:50 sshd.service -> /lib/systemd/system/ssh.service
drwxr-xr-x  2 root root 4096 Dec  7 13:56 sysinit.target.wants/
lrwxrwxrwx  1 root root   35 Dec  7 13:48 syslog.service -> /lib/systemd/system/rsyslog.service
drwxr-xr-x  2 root root 4096 Dec  7 13:50 timers.target.wants/

# ps aux|grep kube
root      341978  0.0  0.0   8160   664 pts/0    S+   11:35   0:00 grep --color=auto kube

# crictl pods
POD ID              CREATED             STATE               NAME                                              NAMESPACE           ATTEMPT             RUNTIME
4efcf87651440       34 hours ago        Ready               cluster-autoscaler-69fcb8fcd6-x59gs               kube-system         0                   (default)
7b1f4cfcde4ad       45 hours ago        Ready               nginx-5564b68b9-6njfd                             default             0                   (default)
8e433f9004b6a       46 hours ago        Ready               traefik-5dd496474-ccmd8                           kube-system         0                   (default)
d81ebe81a4870       46 hours ago        Ready               local-path-provisioner-7ff9579c6-7xk9c            kube-system         0                   (default)
5183e7b4c8ef0       46 hours ago        Ready               coredns-66c464876b-8bckw                          kube-system         0                   (default)
b6565ca8aeae8       46 hours ago        Ready               metrics-server-7b4f8b595-2r9sj                    kube-system         0                   (default)
411a261475516       46 hours ago        NotReady            helm-install-traefik-h558x                        kube-system         0                   (default)
a6a489e539fb7       46 hours ago        Ready               hcloud-cloud-controller-manager-7cf4f5974-2tcl8   kube-system         0                   (default)

Bessonov · 2021-01-05T10:45:33Z

So NodeMonitorGracePeriod is relevant, unless k3s changes something about it.

You've linked ccm. Here is cm: https://github.com/k3s-io/k3s/blob/v1.19.5+k3s2/pkg/daemons/control/server.go#L130 . I can't see nothing special. But the log output is:

Running kube-controller-manager --address=127.0.0.1 --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --port=10252 --profiling=false --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true

Be aware, that now 1.20 is marked as stable.

mfrister · 2021-01-07T09:14:23Z

You've linked ccm.

Good point, while k3s sets node-status-update-frequency for the diasbled cloud controller manager, it doesn't set it for the kubelets. Sorry I sent you into a wrong direction there.

After 65 minutes, the node should be gone.

I've tried to reproduce the issue again and at some point I ended up in the situation you described - the node didn't get removed.

When digging into k3s logs, I found this on one of the other servers:

Jan 07 09:07:50 k3s-1 k3s[826]: time="2021-01-07T09:07:50.020554240+01:00" level=info msg="Removing name=k3s-3-f2f006dd id=13978701145278854308 address=49.12.41.193 from etcd"

hcloud-cloud-controller-manager gets an event from apiserver:

I0107 08:32:00.232610       1 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k3s-3", UID:"eec299d5-48ea-48f1-bb76-b3778b64c2a4", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Deleting node k3s-3 because it does not exist in the cloud provider' Node k3s-3 event: DeletingNode

Still, the node appears in later requests to list nodes, indicating that it hasn't actually been deleted from etcd (or state in etcd is not consistent between the two remaining nodes).

I then realized that you're running 3 Kubernetes masters with k3s' embedded etcd. My current guess is that node removal fails due to apiserver being unable to write to etcd, which doesn't seem to be able to deal with removed nodes in the configuration you're using (maybe this is a problem with k3s' embedded etcd in general).

There are additional indications that k3s' embedded etcd may have problems with node removal:

etcd leader election only seeems to be happening 15 minutes later, followed by stacktraces from the k3s process (haven't checked which part of it).
There are multiple Github issues related to the embedded etcd.
There may also be issues with the embedded etcd and public IPs.
Embedded etcd was only recently introduced (v1.19.5+k3s1), making it more likely to still have significant bugs.

With all these problems seemingly related to k3s' use of etcd, I believe this is a k3s issue and unrelated to our cloud controller manager, which correctly tells Kubernets that the node doesn't exist.

Bessonov · 2021-01-09T11:13:37Z

@mfrister thank you very much for your help. Seem like a dupe of k3s-io/k3s#2669 .

LKaemmerling added the question Further information is requested label Jan 4, 2021

Bessonov mentioned this issue Jan 8, 2021

Node isn't deleted from kube after deleted in cloud provider k3s-io/k3s#2796

Closed

Bessonov closed this as completed Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node isn't deleted from kube after deleted in panel or through cli #142

Node isn't deleted from kube after deleted in panel or through cli #142

Bessonov commented Dec 31, 2020

mfrister commented Jan 4, 2021

Bessonov commented Jan 4, 2021 •

edited

Loading

mfrister commented Jan 5, 2021

Bessonov commented Jan 5, 2021

Bessonov commented Jan 5, 2021

mfrister commented Jan 7, 2021

Bessonov commented Jan 9, 2021

Node isn't deleted from kube after deleted in panel or through cli #142

Node isn't deleted from kube after deleted in panel or through cli #142

Comments

Bessonov commented Dec 31, 2020

mfrister commented Jan 4, 2021

Bessonov commented Jan 4, 2021 • edited Loading

mfrister commented Jan 5, 2021

Bessonov commented Jan 5, 2021

Bessonov commented Jan 5, 2021

mfrister commented Jan 7, 2021

Bessonov commented Jan 9, 2021

Bessonov commented Jan 4, 2021 •

edited

Loading