Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to add a node after removing a failed node using embedded etcd #2732

Closed
jrote1 opened this issue Dec 21, 2020 · 22 comments
Closed

Unable to add a node after removing a failed node using embedded etcd #2732

jrote1 opened this issue Dec 21, 2020 · 22 comments

Comments

@jrote1
Copy link

jrote1 commented Dec 21, 2020

Environmental Info:
K3s Version: k3s version v1.19.5+k3s1 (b11612e)

Node(s) CPU architecture, OS, and Version: Linux master-02 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Originally was 3 master, then one died so was removed now running 2 masters but cannot readd a 3rd master

Describe the bug:
When trying to add the 3rd master I am getting the following error

Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.903306802Z" level=info msg="Adding https://192.168.0.13:2380 to etcd cluster [master-01-9d17c397=https://192.168.0.11:2380 master-03-e2ec81cc=https://192.168.0.13:2380 master-02-8a1215a5=https://192.168.0.12:2380]"
Dec 21 15:00:45 master-03 k3s[1952]: {"level":"warn","ts":"2020-12-21T15:00:45.905Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-247cfa63-67b2-477d-8202-8891774f3d4f/192.168.0.11:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.906496636Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"

Strangely it looks like the cluster is still trying to communicate with the dead node although it is no longer under nodes

[Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.711Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.712Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.0.13:2380/version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
root@master-02:/home/jrote1# kubectl get node
NAME        STATUS   ROLES         AGE     VERSION
master-01   Ready    etcd,master   22d     v1.19.5+k3s1
master-02   Ready    etcd,master   6d19h   v1.19.5+k3s1

Steps To Reproduce:

  • Installed K3s
  • Add 3 master nodes
  • Master 3 dies so remove from nodes
  • Attempted to re add master 3
@brandond
Copy link
Member

This seems to be related to the issue discussed here: #2533 (comment)

The member maintenance function should probably check for unhealthy (unreachable) fully-promoted etcd cluster members, and remove them from the cluster if there is not a corresponding Kubernetes cluster member.

@brandond
Copy link
Member

@jrote are you able to collect logs from the two remaining servers in the time period that you removed the failed node from the cluster? Nodes are supposed to be removed from the etcd cluster when you remove them from the kubernetes cluster, so I'm curious to see if there are any logs showing how that failed.

journalctl --no-pager -u k3s &> k3s-$(hostname).logand attach the resulting files?

@brandond
Copy link
Member

brandond commented Dec 21, 2020

As a temporary workaround, you should be able to do the following on one of the working nodes:

kubectl run --rm --tty --stdin --image docker.io/bitnami/etcd:latest etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/bash"],"image":"docker.io/bitnami/etcd:latest","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}]}}'

In the resulting shell, run the following command:

./bin/etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove master-03-e2ec81cc

@MalibuKoKo
Copy link

I have the same issue but in a different environment:

  • arch: arm64
  • version: v1.19.4+k3s

I used image from rancher/coreos-etcd on docker hub : https://hub.docker.com/r/rancher/coreos-etcd/tags?page=1&ordering=last_updated

Actually last tag for my arch is : v3.4.13-arm64

I add nodeSelector's rule in order to deploy pod on a working node with the role named: etcd

kubectl run --rm --tty --stdin --image docker.io/rancher/coreos-etcd:v3.4.13-arm64 etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/sh"],"image":"docker.io/rancher/coreos-etcd:v3.4.13-arm64","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}],"nodeSelector":{"node-role.kubernetes.io/etcd":"true"}}}'

you can list etcd members:

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member list

you can remove member

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove 1234567890ABCDEF

uninstall k3s

k3s-uninstall.sh

and add it again into your cluster.
It's works !!!

@jrote1
Copy link
Author

jrote1 commented Dec 22, 2020

I have the log files for the day it broke but they are to big for GitHub, any preferences as to where I upload them to?

@brandond
Copy link
Member

brandond commented Dec 22, 2020

@jrote1 can you share from Google Drive or Dropbox perhaps? I'm assuming they're too large even after compression?

@brandond
Copy link
Member

Thanks. You should see an access request from my (personal) gmail account.

@jrote1
Copy link
Author

jrote1 commented Dec 23, 2020

Already accepted @brandond

@brandond
Copy link
Member

brandond commented Dec 23, 2020

Thanks for the logs! FWIW I think they might have been small enough for Github if gzip'd.

There's a lot of this in there, which isn't a good sign:

Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.000Z","caller":"etcdserver/raft.go:390","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"b51f0190e848738d","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"1.672987815s"}
Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.000Z","caller":"etcdserver/raft.go:390","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"787a9a69bcefecea","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"1.673190469s"}

Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"warn","ts":"2020-12-21T00:00:13.117Z","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"9.972996125s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/services/endpoints/kube-system/kube-scheduler\" ","response":"","error":"context canceled"}
Dec 21 00:00:13 master-01 k3s[4000764]: {"level":"info","ts":"2020-12-21T00:00:13.119Z","caller":"traceutil/trace.go:145","msg":"trace[1873224372] range","detail":"{range_begin:/registry/services/endpoints/kube-system/kube-scheduler; range_end:; }","duration":"9.974944152s","start":"2020-12-21T00:00:03.144Z","end":"2020-12-21T00:00:13.119Z","steps":["trace[1873224372] 'agreement among raft nodes before linearized reading'  (duration: 9.972988738s)"]}

It looks like k3s has been crashing because etcd can't keep up with controller leader election requests:

Dec 21 00:02:06 master-01 k3s[4000764]: E1221 00:02:06.880465 4000764 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager": context deadline exceeded
Dec 21 00:02:06 master-01 k3s[4000764]: E1221 00:02:06.880565 4000764 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Dec 21 00:02:06 master-01 k3s[4000764]: I1221 00:02:06.884716 4000764 leaderelection.go:278] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
Dec 21 00:02:06 master-01 k3s[4000764]: F1221 00:02:06.886574 4000764 controllermanager.go:220] leaderelection lost

Are you running Longhorn and etcd on the same disk? Etcd is VERY touchy about fsync latency and doesn't really like to have too much else sharing IO with it.

I don't see the etcd controller attempting to remove the member from the cluster though, which is odd. It only logs the removal if it can find it in the cluster and is attempting to remove it, which isn't super helpful. I'm about to rewrite some of this and will add some more logging when I do so that we can hopefully figure out what's causing this if it happens again.

@jrote1
Copy link
Author

jrote1 commented Dec 23, 2020

I am running etcd on the same disk as longhorn currently, I do have plans to move longhorn onto a seperate disk but I have not bought the disks yet.
Let me double check a bit later to make the time frame of when i removed the node is in the logs

@brandond
Copy link
Member

I do see the node removal event from a different controller, but the etcd controller doesn't appear to be handling it:

Dec 21 14:24:27 master-02 k3s[2448291]: I1221 14:24:27.230616 2448291 event.go:291] "Event occurred" object="master-03" kind="Node" apiVersion="v1" type="Normal" reason="RemovingNode" message="Node master-03 event: Removing Node master-03 from Controller"

@xiaods
Copy link
Contributor

xiaods commented Feb 4, 2021

I can reproduce the problem stably. xiaods/k8e#36
it seems k8s controller manager issue

@xiaods
Copy link
Contributor

xiaods commented Feb 5, 2021

any suggestions or update?

@brandond
Copy link
Member

brandond commented Feb 5, 2021

Waiting on another PR to land and then I'm going to rework etcd cluster membership cleanup.

@xiaods
Copy link
Contributor

xiaods commented Mar 9, 2021

@brandond I have reviewed the latest commit from branch 1.19-release, it seems to resolve this case.

@xiaods
Copy link
Contributor

xiaods commented Mar 11, 2021

confirmed, it reoslved.

@xiaods
Copy link
Contributor

xiaods commented Mar 11, 2021

@jrote1 please double check

@mindkeep
Copy link

mindkeep commented Apr 6, 2021

Not sure if this should be a new case or not, but I'm able to pretty readily break embedded etcd with the following sequence:
% k3d version
k3d version v4.4.1
k3s version v1.20.5-k3s1 (default)
% cat break-k3d.sh
#!/bin/bash
set -x
k3d cluster create --servers 3 --agents 3 test-k3d
docker stop k3d-test-k3d-server-0
docker stop k3d-test-k3d-agent-1
docker start k3d-test-k3d-agent-1
docker start k3d-test-k3d-server-0
sleep 120
docker logs k3d-test-k3d-server-0
k3d cluster delete test-k3d
%

where we wind up with this message repeated:

time="2021-04-06T21:20:30.180484636Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-test-k3d-server-1-78ae1551=https://172.20.0.3:2380 k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 k3d-test-k3d-server-2-31d30714=https://172.20.0.4:2380], expect: k3d-test-k3d-server-0-3e94d5e1=172.20.0.6"

Does this look like the same problem? Or should I open a new issue here?

@brandond
Copy link
Member

brandond commented Apr 6, 2021

@mindkeep That looks like a k3d issue and is not related to what's going on here - the docker container is getting a new address when you stop and start it (k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 vs k3d-test-k3d-server-0-3e94d5e1=172.20.0.6)

@mindkeep
Copy link

mindkeep commented Apr 6, 2021

@brandond Ah, that makes sense. I'll bring it up there. Thanks for the quick response.
(k3d-io/k3d#550)

@stale
Copy link

stale bot commented Oct 4, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants