Unable to add a node after removing a failed node using embedded etcd #2732

jrote1 · 2020-12-21T15:19:23Z

Environmental Info:
K3s Version: k3s version v1.19.5+k3s1 (b11612e)

Node(s) CPU architecture, OS, and Version: Linux master-02 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Originally was 3 master, then one died so was removed now running 2 masters but cannot readd a 3rd master

Describe the bug:
When trying to add the 3rd master I am getting the following error

Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.903306802Z" level=info msg="Adding https://192.168.0.13:2380 to etcd cluster [master-01-9d17c397=https://192.168.0.11:2380 master-03-e2ec81cc=https://192.168.0.13:2380 master-02-8a1215a5=https://192.168.0.12:2380]"
Dec 21 15:00:45 master-03 k3s[1952]: {"level":"warn","ts":"2020-12-21T15:00:45.905Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-247cfa63-67b2-477d-8202-8891774f3d4f/192.168.0.11:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.906496636Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"

Strangely it looks like the cluster is still trying to communicate with the dead node although it is no longer under nodes

[Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.711Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.712Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.0.13:2380/version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}

root@master-02:/home/jrote1# kubectl get node
NAME        STATUS   ROLES         AGE     VERSION
master-01   Ready    etcd,master   22d     v1.19.5+k3s1
master-02   Ready    etcd,master   6d19h   v1.19.5+k3s1

Steps To Reproduce:

Installed K3s
Add 3 master nodes
Master 3 dies so remove from nodes
Attempted to re add master 3

The text was updated successfully, but these errors were encountered:

brandond · 2020-12-21T18:57:26Z

This seems to be related to the issue discussed here: #2533 (comment)

The member maintenance function should probably check for unhealthy (unreachable) fully-promoted etcd cluster members, and remove them from the cluster if there is not a corresponding Kubernetes cluster member.

brandond · 2020-12-21T19:47:16Z

@jrote are you able to collect logs from the two remaining servers in the time period that you removed the failed node from the cluster? Nodes are supposed to be removed from the etcd cluster when you remove them from the kubernetes cluster, so I'm curious to see if there are any logs showing how that failed.

journalctl --no-pager -u k3s &> k3s-$(hostname).logand attach the resulting files?

brandond · 2020-12-21T20:30:04Z

As a temporary workaround, you should be able to do the following on one of the working nodes:

kubectl run --rm --tty --stdin --image docker.io/bitnami/etcd:latest etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/bash"],"image":"docker.io/bitnami/etcd:latest","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}]}}'

In the resulting shell, run the following command:

./bin/etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove master-03-e2ec81cc

MalibuKoKo · 2020-12-22T11:02:45Z

I have the same issue but in a different environment:

arch: arm64
version: v1.19.4+k3s

I used image from rancher/coreos-etcd on docker hub : https://hub.docker.com/r/rancher/coreos-etcd/tags?page=1&ordering=last_updated

Actually last tag for my arch is : v3.4.13-arm64

I add nodeSelector's rule in order to deploy pod on a working node with the role named: etcd

kubectl run --rm --tty --stdin --image docker.io/rancher/coreos-etcd:v3.4.13-arm64 etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/sh"],"image":"docker.io/rancher/coreos-etcd:v3.4.13-arm64","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}],"nodeSelector":{"node-role.kubernetes.io/etcd":"true"}}}'

you can list etcd members:

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member list

you can remove member

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove 1234567890ABCDEF

uninstall k3s

k3s-uninstall.sh

and add it again into your cluster.
It's works !!!

jrote1 · 2020-12-22T11:40:14Z

I have the log files for the day it broke but they are to big for GitHub, any preferences as to where I upload them to?

brandond · 2020-12-22T20:45:47Z

@jrote1 can you share from Google Drive or Dropbox perhaps? I'm assuming they're too large even after compression?

jrote1 · 2020-12-22T21:21:58Z

https://drive.google.com/file/d/1rvmCYlCbZ_KKTC04ySo5Mt7fBWen2dtj/view?usp=sharing
https://drive.google.com/file/d/1JO7DMF3jiCNaL4JVHrK1E5e6-RlHV6gN/view?usp=sharing

brandond · 2020-12-23T00:41:55Z

Thanks. You should see an access request from my (personal) gmail account.

jrote1 · 2020-12-23T00:43:03Z

Already accepted @brandond

brandond · 2020-12-23T09:37:15Z

Thanks for the logs! FWIW I think they might have been small enough for Github if gzip'd.

There's a lot of this in there, which isn't a good sign:

Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.000Z","caller":"etcdserver/raft.go:390","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"b51f0190e848738d","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"1.672987815s"}
Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.000Z","caller":"etcdserver/raft.go:390","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"787a9a69bcefecea","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"1.673190469s"}

Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.117Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"9.972996125s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/services/endpoints/kube-system/kube-scheduler\" ","response":"","error":"context canceled"}
Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"info","ts":"2020-12-21T00:00:13.119Z","caller":"traceutil/trace.go:145","msg":"trace[1873224372] range","detail":"{range_begin:/registry/services/endpoints/kube-system/kube-scheduler; range_end:; }","duration":"9.974944152s","start":"2020-12-21T00:00:03.144Z","end":"2020-12-21T00:00:13.119Z","steps":["trace[1873224372] 'agreement among raft nodes before linearized reading'  (duration: 9.972988738s)"]}

It looks like k3s has been crashing because etcd can't keep up with controller leader election requests:

Dec 21 00:02:06 master-01 k3s[4000764]: E1221 00:02:06.880465 4000764 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager": context deadline exceeded
Dec 21 00:02:06 master-01 k3s[4000764]: E1221 00:02:06.880565 4000764 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Dec 21 00:02:06 master-01 k3s[4000764]: I1221 00:02:06.884716 4000764 leaderelection.go:278] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
Dec 21 00:02:06 master-01 k3s[4000764]: F1221 00:02:06.886574 4000764 controllermanager.go:220] leaderelection lost

Are you running Longhorn and etcd on the same disk? Etcd is VERY touchy about fsync latency and doesn't really like to have too much else sharing IO with it.

I don't see the etcd controller attempting to remove the member from the cluster though, which is odd. It only logs the removal if it can find it in the cluster and is attempting to remove it, which isn't super helpful. I'm about to rewrite some of this and will add some more logging when I do so that we can hopefully figure out what's causing this if it happens again.

jrote1 · 2020-12-23T11:40:33Z

I am running etcd on the same disk as longhorn currently, I do have plans to move longhorn onto a seperate disk but I have not bought the disks yet.
Let me double check a bit later to make the time frame of when i removed the node is in the logs

brandond · 2020-12-23T18:11:00Z

I do see the node removal event from a different controller, but the etcd controller doesn't appear to be handling it:

Dec 21 14:24:27 master-02 k3s[2448291]: I1221 14:24:27.230616 2448291 event.go:291] "Event occurred" object="master-03" kind="Node" apiVersion="v1" type="Normal" reason="RemovingNode" message="Node master-03 event: Removing Node master-03 from Controller"

xiaods · 2021-02-04T05:16:57Z

I can reproduce the problem stably. xiaods/k8e#36
it seems k8s controller manager issue

xiaods · 2021-02-05T04:53:32Z

any suggestions or update?

brandond · 2021-02-05T05:39:52Z

Waiting on another PR to land and then I'm going to rework etcd cluster membership cleanup.

xiaods · 2021-03-09T11:04:14Z

@brandond I have reviewed the latest commit from branch 1.19-release, it seems to resolve this case.

xiaods · 2021-03-11T02:45:59Z

confirmed, it reoslved.

xiaods · 2021-03-11T02:51:30Z

@jrote1 please double check

mindkeep · 2021-04-06T21:23:44Z

Not sure if this should be a new case or not, but I'm able to pretty readily break embedded etcd with the following sequence:
% k3d version
k3d version v4.4.1
k3s version v1.20.5-k3s1 (default)
% cat break-k3d.sh
#!/bin/bash
set -x
k3d cluster create --servers 3 --agents 3 test-k3d
docker stop k3d-test-k3d-server-0
docker stop k3d-test-k3d-agent-1
docker start k3d-test-k3d-agent-1
docker start k3d-test-k3d-server-0
sleep 120
docker logs k3d-test-k3d-server-0
k3d cluster delete test-k3d
%

where we wind up with this message repeated:

time="2021-04-06T21:20:30.180484636Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-test-k3d-server-1-78ae1551=https://172.20.0.3:2380 k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 k3d-test-k3d-server-2-31d30714=https://172.20.0.4:2380], expect: k3d-test-k3d-server-0-3e94d5e1=172.20.0.6"

Does this look like the same problem? Or should I open a new issue here?

brandond · 2021-04-06T21:30:42Z

@mindkeep That looks like a k3d issue and is not related to what's going on here - the docker container is getting a new address when you stop and start it (k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 vs k3d-test-k3d-server-0-3e94d5e1=172.20.0.6)

mindkeep · 2021-04-06T22:23:13Z

@brandond Ah, that makes sense. I'll bring it up there. Thanks for the quick response.
(k3d-io/k3d#550)

stale · 2021-10-04T01:08:25Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

mfrister mentioned this issue Jan 7, 2021

Node isn't deleted from kube after deleted in panel or through cli hetznercloud/hcloud-cloud-controller-manager#142

Closed

xiaods mentioned this issue Feb 2, 2021

k8e在测试的时候发现upstream k3s有一个致命bug，就是etcd没有起来的时候controller选主会panic xiaods/k8e#36

Closed

xiaods mentioned this issue Feb 4, 2021

controller manager leader lost case kubernetes/kubernetes#98751

Closed

guangbochen mentioned this issue Jul 26, 2021

[BUG] failed to promote the previous node after re-join the cluster harvester/harvester#1040

Closed

stale bot added the status/stale label Oct 4, 2021

stale bot closed this as completed Oct 18, 2021

davidcassany mentioned this issue Oct 18, 2023

reset: etcd not completely reset on k3s cluster rancher/elemental-operator#546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to add a node after removing a failed node using embedded etcd #2732

Unable to add a node after removing a failed node using embedded etcd #2732

jrote1 commented Dec 21, 2020

brandond commented Dec 21, 2020

brandond commented Dec 21, 2020

brandond commented Dec 21, 2020 •

edited

Loading

MalibuKoKo commented Dec 22, 2020

jrote1 commented Dec 22, 2020

brandond commented Dec 22, 2020 •

edited

Loading

jrote1 commented Dec 22, 2020

brandond commented Dec 23, 2020

jrote1 commented Dec 23, 2020

brandond commented Dec 23, 2020 •

edited

Loading

jrote1 commented Dec 23, 2020

brandond commented Dec 23, 2020

xiaods commented Feb 4, 2021 •

edited

Loading

xiaods commented Feb 5, 2021

brandond commented Feb 5, 2021

xiaods commented Mar 9, 2021

xiaods commented Mar 11, 2021

xiaods commented Mar 11, 2021

mindkeep commented Apr 6, 2021

brandond commented Apr 6, 2021

mindkeep commented Apr 6, 2021 •

edited

Loading

stale bot commented Oct 4, 2021

Unable to add a node after removing a failed node using embedded etcd #2732

Unable to add a node after removing a failed node using embedded etcd #2732

Comments

jrote1 commented Dec 21, 2020

brandond commented Dec 21, 2020

brandond commented Dec 21, 2020

brandond commented Dec 21, 2020 • edited Loading

MalibuKoKo commented Dec 22, 2020

jrote1 commented Dec 22, 2020

brandond commented Dec 22, 2020 • edited Loading

jrote1 commented Dec 22, 2020

brandond commented Dec 23, 2020

jrote1 commented Dec 23, 2020

brandond commented Dec 23, 2020 • edited Loading

jrote1 commented Dec 23, 2020

brandond commented Dec 23, 2020

xiaods commented Feb 4, 2021 • edited Loading

xiaods commented Feb 5, 2021

brandond commented Feb 5, 2021

xiaods commented Mar 9, 2021

xiaods commented Mar 11, 2021

xiaods commented Mar 11, 2021

mindkeep commented Apr 6, 2021

brandond commented Apr 6, 2021

mindkeep commented Apr 6, 2021 • edited Loading

stale bot commented Oct 4, 2021

brandond commented Dec 21, 2020 •

edited

Loading

brandond commented Dec 22, 2020 •

edited

Loading

brandond commented Dec 23, 2020 •

edited

Loading

xiaods commented Feb 4, 2021 •

edited

Loading

mindkeep commented Apr 6, 2021 •

edited

Loading