Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to recover failed master node #2138

Closed
ghost opened this issue May 13, 2020 · 11 comments
Closed

No way to recover failed master node #2138

ghost opened this issue May 13, 2020 · 11 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@ghost
Copy link

ghost commented May 13, 2020

What keywords did you search in kubeadm issues before filing this one?

HA ETCd join rejoin control-plane [master node failure] [master node recreation]

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:54:15Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
Vagrant==2.2.9
Virtualbox==6.0.20
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
[...]
  • Kernel (e.g. uname -a):
Linux master2.vagrant 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64 GNU/Linux
  • Others:
    docker version 19.03.8

What happened? / What you expected to happen?

I have set up three master nodes using kubeadm. The API is sat behind HAProxy in TCP mode.

  1. I destroyed a master node
  2. I reinstalled kubeadm on a brand new box
  3. attempted to rejoin it as a master to the cluster with the original join command.

This was unsuccessful.

How to reproduce it (as minimally and precisely as possible)?

HAProxy in TCP mode round-robins requests to control.vagrant:6443 to port 6443 on the master nodes.

On the first master node I used the command

      kubeadm init
      --control-plane-endpoint control.vagrant:6443
      --upload-certs
      --token-ttl 0
      --token abcdef.0123456789abcdef
      --apiserver-advertise-address 10.0.0.11
      --certificate-key 0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff
      --service-dns-domain cluster.domain
      --node-name master1.vagrant

To initiate the cluster. I then joined masters 2 and 3 using (minor variations on) the following command:

      kubeadm join control.vagrant:6443
      --control-plane
      --token abcdef.0123456789abcdef
      --apiserver-advertise-address 10.0.0.12
      --certificate-key 0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff
      --discovery-token-unsafe-skip-ca-verification
      --node-name master2.vagrant

This successfully creates a cluster of three master nodes, which works perfectly as expected.

When I destroy the second master, master2, to simulate node failure. I find myself unable to re-add it as a master node (with the previous join command). The hostname (master2.vagrant) and the IP 10.0.0.12 are now completely useless for re-adding master nodes.

Anything else we need to know?

  • The [re]join command does not time out, but remains hung with the following line of output repeating:
15826 etcd.go:480] Failed to get etcd status for https://10.0.0.12:2379: failed to dial endpoint https://10.0.0.12:2379 with maintenance client: context deadline exceeded
  • No docker containers are running on the host
  • Kubectl is restarting and exiting with the following log:
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 113.
May 13 01:09:33 master2.vagrant systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
May 13 01:09:33 master2.vagrant systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 13 01:09:33 master2.vagrant kubelet[16178]: F0513 01:09:33.161770   16178 server.go:199] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Main process exited, code=exited, status=255/EXCEPTION
May 13 01:09:33 master2.vagrant systemd[1]: kubelet.service: Failed with result 'exit-code'.
  • Full rejoin log in pastebin
  • Worker nodes recover and re-join as expected, master nodes do not
  • I see the same result regardless of whether I run kubectl delete node master2 on either of the other master nodes between the failure and the recovery.
  • The delay between creating the cluster and attempting node failure has been varied between 30 seconds and about four hours

Why this matters enough to report it

The IP addresses, in the environment we are actually running the real thing in, are not flexible. If we can not recreate a master node "in place" we are not able to recover from master node failure.
Even if the fixed IPs were not an issue, service discovery to join master4, 5, 6... would be a considerable overhead.

@fabriziopandini
Copy link
Member

/triage support
@grahamoptibrium when you lose a master node, usually you should manually perform the following actions to re-align the state to the new situation:

  • manually clean up the list of endpoints in the kubeadm-config map.
  • remove the etcd member from etcd
    After that, you can re-join a new master node

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label May 13, 2020
@ghost
Copy link
Author

ghost commented May 13, 2020

@fabriziopandini Thank you for your reply.

This allows me to recover the cluster manually. I additionally found that for an update in place, editing the configmap was not needed. I need only remove the ETCd member before running the join command again.

I would however have naively expected kubeadm join, when run again with the same node name (and possibly IP), to replace the etcd instance...
Mostly as this this ETCd setup is orchestrated by Kubeadm, so I would normally expect to be more hands off.

@neolit123
Copy link
Member

this opens the question whether kubeadm should remove existing eetcd members with the same URL in the etcd cluster. my initial reaction would be no.

similarly, later versions of kubeadm do not allow you to join a k8s node with the same name as this is disruptive to the existing cluster.

the etcdctl maintenance / interaction here seems appropriate for removal of the existing etcd member, but i'd like to hear more opinions about this.

@ghost
Copy link
Author

ghost commented May 13, 2020

@neolit123 if I may add "manual intervention sucks!" to the discussion...

(Then I will be quiet and let others weigh in!)

@neolit123
Copy link
Member

@grahamhayes

When I destroy the second master, master2, to simulate node failure. I find myself unable to re-add it as a master node (with the previous join command). The hostname (master2.vagrant) and the IP 10.0.0.12 are now completely useless for re-adding master nodes.

hi, how are you destroying the second master to simulate node failure?
kubectl delete node master2 is not exactly a node failure, this is deleting an API object managed by the api-server. a node failure puts the node in a NotReady state, which you can simulate by a node shutdown.

@ghost
Copy link
Author

ghost commented May 13, 2020

@neolit123 I have created all three nodes in a single multi-box Vagrantfile.

Destruction of the box was done with the command vagrant destroy master2, which performs an immediate power-off of the virtual machine and deletes its hard disks.

When the machine is recreated with the command vagrant up master2, it automatically runs my shell/ansible provisioner, which will install packages like Kubeadm, Kubelet, Docker...etc; then it will attempt to run the previously mentioned join command in an effort to rejoin the collective.

Running the kubectl command to remove it from the API was just an unsuccessful ditch attempt at enabling the join command to work again.

@neolit123
Copy link
Member

neolit123 commented May 13, 2020

understood,

[1] if the node is wiped completely, kubectl delete node is a already a mandatory manual step as kubeadm join will fail saying that a node with the same name already exists. this is a precaution for users shooting them self in the foot and bringing their control-plane down due to a race in etcd.

failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read

this log line is not relevant as later kubeadm join should fetch the contents of the file from the cluster and write the file to disk, causing the kubelet (managed by systemd) to pick it up and start (stop crash-looping).

Mostly as this this ETCd setup is orchestrated by Kubeadm, so I would normally expect to be more hands off.

ideally yes, but it's not very simple and needs design / evaluation - e.g. if an etcd member from an IP is joining but a member with the same IP already exists, is kubeadm supposed to auto-delete that existing member?
what if someone made the mistake of re-using the same etcd member IP, while this is not the actual member IP. this can be tricky.

more discussion on #2095, so i suggest you follow that and we can close this ticket.

but given [1], manual intervention is already a fact.

@ghost
Copy link
Author

ghost commented May 13, 2020

@neolit123 Thank you for taking the time to address this one!

I apologize: I definitely do not wish to appear contradictory, but I shall mention this only in the spirit of being a good tester ^_^

if the node is wiped completely, kubectl delete node is a already a mandatory manual step as kubeadm join will fail saying that a node with the same name already exists

If this is the expected behaviour, then it is functioning incorrectly, for I am able to rejoin a completely new node with the same name and IP without touching kubectl. The only seemingly required step is to remove the unhealthy ETCd pod with etcdctl on one of the other master nodes...

I agree with your prognosis on foot shooting: The two options that I might suggest, which would fit the recovery in place use-case are:

kubeadm join --allow-rejoin  # Will kill ETCd pod if present on that hostname.
kubeadm remove --node-id <some remote node>  # Will delete node and remove ETCd pod

@neolit123
Copy link
Member

neolit123 commented May 13, 2020

If this is the expected behaviour, then it is functioning incorrectly, for I am able to rejoin a completely new node with the same name and IP without touching kubectl

what is the full output of kubeadm join .... --v=1?
also before running kubeadm join please give the output of kubectl get no and what the name of this new node is.

kubeadm join --allow-rejoin

related to the etcd idempotency topic, please make your proposals to the issue that i linked, so that others can see it too.

kubeadm remove --node-id # Will delete node and remove ETCd pod

it is out of scope for a kubeadm command to delete the Node object.

@ghost
Copy link
Author

ghost commented May 13, 2020

before running kubeadm join please give the output of kubectl get no and what the name of this new node is.

The replacement node is called master2.vagrant.

root@master3:/home/vagrant# kubectl get nodes
NAME              STATUS     ROLES    AGE   VERSION
master1.vagrant   Ready      master   23h   v1.18.2
master2.vagrant   NotReady   master   12h   v1.18.2
master3.vagrant   Ready      master   23h   v1.18.2
worker1.vagrant   NotReady   <none>   23h   v1.18.2
worker2.vagrant   Ready      <none>   23h   v1.18.2
worker3.vagrant   Ready      <none>   23h   v1.18.2

what is the full output of kubeadm join .... --v=1?
I have included it in the following pastebin

@neolit123
Copy link
Member

ok, so i forgot an important detail:

I0513 23:38:12.241165 15072 kubelet.go:145] [kubelet-start] Checking for an existing Node in the cluster with name "master2.vagrant" and status "Ready"

it only fails if the existing node is Ready, so in your case this is working as expected for the node "master2.vagrant". the idea is to not break existing Ready (working) nodes with "join".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants