No way to recover failed master node #2138

ghost · 2020-05-13T02:19:13Z

What keywords did you search in kubeadm issues before filing this one?

HA ETCd join rejoin control-plane [master node failure] [master node recreation]

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:54:15Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

Vagrant==2.2.9
Virtualbox==6.0.20

OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
[...]

Kernel (e.g. uname -a):

Linux master2.vagrant 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64 GNU/Linux

Others:
docker version 19.03.8

What happened? / What you expected to happen?

I have set up three master nodes using kubeadm. The API is sat behind HAProxy in TCP mode.

I destroyed a master node
I reinstalled kubeadm on a brand new box
attempted to rejoin it as a master to the cluster with the original join command.

This was unsuccessful.

How to reproduce it (as minimally and precisely as possible)?

HAProxy in TCP mode round-robins requests to control.vagrant:6443 to port 6443 on the master nodes.

On the first master node I used the command

      kubeadm init
      --control-plane-endpoint control.vagrant:6443
      --upload-certs
      --token-ttl 0
      --token abcdef.0123456789abcdef
      --apiserver-advertise-address 10.0.0.11
      --certificate-key 0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff
      --service-dns-domain cluster.domain
      --node-name master1.vagrant

To initiate the cluster. I then joined masters 2 and 3 using (minor variations on) the following command:

      kubeadm join control.vagrant:6443
      --control-plane
      --token abcdef.0123456789abcdef
      --apiserver-advertise-address 10.0.0.12
      --certificate-key 0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff
      --discovery-token-unsafe-skip-ca-verification
      --node-name master2.vagrant

This successfully creates a cluster of three master nodes, which works perfectly as expected.

When I destroy the second master, master2, to simulate node failure. I find myself unable to re-add it as a master node (with the previous join command). The hostname (master2.vagrant) and the IP 10.0.0.12 are now completely useless for re-adding master nodes.

Anything else we need to know?

The [re]join command does not time out, but remains hung with the following line of output repeating:

15826 etcd.go:480] Failed to get etcd status for https://10.0.0.12:2379: failed to dial endpoint https://10.0.0.12:2379 with maintenance client: context deadline exceeded

No docker containers are running on the host
Kubectl is restarting and exiting with the following log:

May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 113.
May 13 01:09:33 master2.vagrant systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
May 13 01:09:33 master2.vagrant systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 13 01:09:33 master2.vagrant kubelet[16178]: F0513 01:09:33.161770   16178 server.go:199] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Main process exited, code=exited, status=255/EXCEPTION
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Failed with result 'exit-code'.

Full rejoin log in pastebin
Worker nodes recover and re-join as expected, master nodes do not
I see the same result regardless of whether I run kubectl delete node master2 on either of the other master nodes between the failure and the recovery.
The delay between creating the cluster and attempting node failure has been varied between 30 seconds and about four hours

Why this matters enough to report it

The IP addresses, in the environment we are actually running the real thing in, are not flexible. If we can not recreate a master node "in place" we are not able to recover from master node failure.
Even if the fixed IPs were not an issue, service discovery to join master4, 5, 6... would be a considerable overhead.

The text was updated successfully, but these errors were encountered:

fabriziopandini · 2020-05-13T08:59:50Z

/triage support
@grahamoptibrium when you lose a master node, usually you should manually perform the following actions to re-align the state to the new situation:

manually clean up the list of endpoints in the kubeadm-config map.
remove the etcd member from etcd
After that, you can re-join a new master node

ghost · 2020-05-13T11:28:17Z

@fabriziopandini Thank you for your reply.

This allows me to recover the cluster manually. I additionally found that for an update in place, editing the configmap was not needed. I need only remove the ETCd member before running the join command again.

I would however have naively expected kubeadm join, when run again with the same node name (and possibly IP), to replace the etcd instance...
Mostly as this this ETCd setup is orchestrated by Kubeadm, so I would normally expect to be more hands off.

neolit123 · 2020-05-13T13:14:41Z

this opens the question whether kubeadm should remove existing eetcd members with the same URL in the etcd cluster. my initial reaction would be no.

similarly, later versions of kubeadm do not allow you to join a k8s node with the same name as this is disruptive to the existing cluster.

the etcdctl maintenance / interaction here seems appropriate for removal of the existing etcd member, but i'd like to hear more opinions about this.

ghost · 2020-05-13T21:12:43Z

@neolit123 if I may add "manual intervention sucks!" to the discussion...

(Then I will be quiet and let others weigh in!)

neolit123 · 2020-05-13T21:39:58Z

@grahamhayes

When I destroy the second master, master2, to simulate node failure. I find myself unable to re-add it as a master node (with the previous join command). The hostname (master2.vagrant) and the IP 10.0.0.12 are now completely useless for re-adding master nodes.

hi, how are you destroying the second master to simulate node failure?
kubectl delete node master2 is not exactly a node failure, this is deleting an API object managed by the api-server. a node failure puts the node in a NotReady state, which you can simulate by a node shutdown.

ghost · 2020-05-13T21:49:42Z

@neolit123 I have created all three nodes in a single multi-box Vagrantfile.

Destruction of the box was done with the command vagrant destroy master2, which performs an immediate power-off of the virtual machine and deletes its hard disks.

When the machine is recreated with the command vagrant up master2, it automatically runs my shell/ansible provisioner, which will install packages like Kubeadm, Kubelet, Docker...etc; then it will attempt to run the previously mentioned join command in an effort to rejoin the collective.

Running the kubectl command to remove it from the API was just an unsuccessful ditch attempt at enabling the join command to work again.

neolit123 · 2020-05-13T22:24:26Z

understood,

[1] if the node is wiped completely, kubectl delete node is a already a mandatory manual step as kubeadm join will fail saying that a node with the same name already exists. this is a precaution for users shooting them self in the foot and bringing their control-plane down due to a race in etcd.

failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read

this log line is not relevant as later kubeadm join should fetch the contents of the file from the cluster and write the file to disk, causing the kubelet (managed by systemd) to pick it up and start (stop crash-looping).

Mostly as this this ETCd setup is orchestrated by Kubeadm, so I would normally expect to be more hands off.

ideally yes, but it's not very simple and needs design / evaluation - e.g. if an etcd member from an IP is joining but a member with the same IP already exists, is kubeadm supposed to auto-delete that existing member?
what if someone made the mistake of re-using the same etcd member IP, while this is not the actual member IP. this can be tricky.

more discussion on #2095, so i suggest you follow that and we can close this ticket.

but given [1], manual intervention is already a fact.

ghost · 2020-05-13T23:19:55Z

@neolit123 Thank you for taking the time to address this one!

I apologize: I definitely do not wish to appear contradictory, but I shall mention this only in the spirit of being a good tester ^_^

if the node is wiped completely, kubectl delete node is a already a mandatory manual step as kubeadm join will fail saying that a node with the same name already exists

If this is the expected behaviour, then it is functioning incorrectly, for I am able to rejoin a completely new node with the same name and IP without touching kubectl. The only seemingly required step is to remove the unhealthy ETCd pod with etcdctl on one of the other master nodes...

I agree with your prognosis on foot shooting: The two options that I might suggest, which would fit the recovery in place use-case are:

kubeadm join --allow-rejoin  # Will kill ETCd pod if present on that hostname.

kubeadm remove --node-id <some remote node>  # Will delete node and remove ETCd pod

neolit123 · 2020-05-13T23:24:32Z

If this is the expected behaviour, then it is functioning incorrectly, for I am able to rejoin a completely new node with the same name and IP without touching kubectl

what is the full output of kubeadm join .... --v=1?
also before running kubeadm join please give the output of kubectl get no and what the name of this new node is.

kubeadm join --allow-rejoin

related to the etcd idempotency topic, please make your proposals to the issue that i linked, so that others can see it too.

kubeadm remove --node-id # Will delete node and remove ETCd pod

it is out of scope for a kubeadm command to delete the Node object.

ghost · 2020-05-13T23:46:21Z

before running kubeadm join please give the output of kubectl get no and what the name of this new node is.

The replacement node is called master2.vagrant.

root@master3:/home/vagrant# kubectl get nodes
NAME              STATUS     ROLES    AGE   VERSION
master1.vagrant   Ready      master   23h   v1.18.2
master2.vagrant   NotReady   master   12h   v1.18.2
master3.vagrant   Ready      master   23h   v1.18.2
worker1.vagrant   NotReady   <none>   23h   v1.18.2
worker2.vagrant   Ready      <none>   23h   v1.18.2
worker3.vagrant   Ready      <none>   23h   v1.18.2

what is the full output of kubeadm join .... --v=1?
I have included it in the following pastebin

neolit123 · 2020-05-13T23:56:18Z

ok, so i forgot an important detail:

I0513 23:38:12.241165 15072 kubelet.go:145] [kubelet-start] Checking for an existing Node in the cluster with name "master2.vagrant" and status "Ready"

it only fails if the existing node is Ready, so in your case this is working as expected for the node "master2.vagrant". the idea is to not break existing Ready (working) nodes with "join".

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label May 13, 2020

neolit123 closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to recover failed master node #2138

No way to recover failed master node #2138

ghost commented May 13, 2020 •

edited by ghost

Loading

fabriziopandini commented May 13, 2020

ghost commented May 13, 2020 •

edited by ghost

Loading

neolit123 commented May 13, 2020

ghost commented May 13, 2020

neolit123 commented May 13, 2020

ghost commented May 13, 2020

neolit123 commented May 13, 2020 •

edited

Loading

ghost commented May 13, 2020 •

edited by ghost

Loading

neolit123 commented May 13, 2020 •

edited

Loading

ghost commented May 13, 2020

neolit123 commented May 13, 2020

No way to recover failed master node #2138

No way to recover failed master node #2138

Comments

ghost commented May 13, 2020 • edited by ghost Loading

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened? / What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Why this matters enough to report it

fabriziopandini commented May 13, 2020

ghost commented May 13, 2020 • edited by ghost Loading

neolit123 commented May 13, 2020

ghost commented May 13, 2020

neolit123 commented May 13, 2020

ghost commented May 13, 2020

neolit123 commented May 13, 2020 • edited Loading

ghost commented May 13, 2020 • edited by ghost Loading

neolit123 commented May 13, 2020 • edited Loading

ghost commented May 13, 2020

neolit123 commented May 13, 2020

ghost commented May 13, 2020 •

edited by ghost

Loading

ghost commented May 13, 2020 •

edited by ghost

Loading

neolit123 commented May 13, 2020 •

edited

Loading

ghost commented May 13, 2020 •

edited by ghost

Loading

neolit123 commented May 13, 2020 •

edited

Loading