-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to add a node after removing a failed node using embedded etcd #2732
Comments
This seems to be related to the issue discussed here: #2533 (comment)
|
@jrote are you able to collect logs from the two remaining servers in the time period that you removed the failed node from the cluster? Nodes are supposed to be removed from the etcd cluster when you remove them from the kubernetes cluster, so I'm curious to see if there are any logs showing how that failed.
|
As a temporary workaround, you should be able to do the following on one of the working nodes: kubectl run --rm --tty --stdin --image docker.io/bitnami/etcd:latest etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/bash"],"image":"docker.io/bitnami/etcd:latest","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}]}}' In the resulting shell, run the following command: ./bin/etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove master-03-e2ec81cc |
I have the same issue but in a different environment:
I used image from rancher/coreos-etcd on docker hub : https://hub.docker.com/r/rancher/coreos-etcd/tags?page=1&ordering=last_updated Actually last tag for my arch is : v3.4.13-arm64 I add nodeSelector's rule in order to deploy pod on a working node with the role named: etcd kubectl run --rm --tty --stdin --image docker.io/rancher/coreos-etcd:v3.4.13-arm64 etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/sh"],"image":"docker.io/rancher/coreos-etcd:v3.4.13-arm64","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}],"nodeSelector":{"node-role.kubernetes.io/etcd":"true"}}}' you can list etcd members: etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member list you can remove member etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove 1234567890ABCDEF uninstall k3s k3s-uninstall.sh and add it again into your cluster. |
I have the log files for the day it broke but they are to big for GitHub, any preferences as to where I upload them to? |
@jrote1 can you share from Google Drive or Dropbox perhaps? I'm assuming they're too large even after compression? |
Thanks. You should see an access request from my (personal) gmail account. |
Already accepted @brandond |
Thanks for the logs! FWIW I think they might have been small enough for Github if gzip'd. There's a lot of this in there, which isn't a good sign:
It looks like k3s has been crashing because etcd can't keep up with controller leader election requests:
Are you running Longhorn and etcd on the same disk? Etcd is VERY touchy about fsync latency and doesn't really like to have too much else sharing IO with it. I don't see the etcd controller attempting to remove the member from the cluster though, which is odd. It only logs the removal if it can find it in the cluster and is attempting to remove it, which isn't super helpful. I'm about to rewrite some of this and will add some more logging when I do so that we can hopefully figure out what's causing this if it happens again. |
I am running etcd on the same disk as longhorn currently, I do have plans to move longhorn onto a seperate disk but I have not bought the disks yet. |
I do see the node removal event from a different controller, but the etcd controller doesn't appear to be handling it:
|
I can reproduce the problem stably. xiaods/k8e#36 |
any suggestions or update? |
Waiting on another PR to land and then I'm going to rework etcd cluster membership cleanup. |
@brandond I have reviewed the latest commit from branch 1.19-release, it seems to resolve this case. |
confirmed, it reoslved. |
@jrote1 please double check |
Not sure if this should be a new case or not, but I'm able to pretty readily break embedded etcd with the following sequence: where we wind up with this message repeated: time="2021-04-06T21:20:30.180484636Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-test-k3d-server-1-78ae1551=https://172.20.0.3:2380 k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 k3d-test-k3d-server-2-31d30714=https://172.20.0.4:2380], expect: k3d-test-k3d-server-0-3e94d5e1=172.20.0.6" Does this look like the same problem? Or should I open a new issue here? |
@mindkeep That looks like a k3d issue and is not related to what's going on here - the docker container is getting a new address when you stop and start it ( |
@brandond Ah, that makes sense. I'll bring it up there. Thanks for the quick response. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Environmental Info:
K3s Version: k3s version v1.19.5+k3s1 (b11612e)
Node(s) CPU architecture, OS, and Version:
Linux master-02 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Originally was 3 master, then one died so was removed now running 2 masters but cannot readd a 3rd master
Describe the bug:
When trying to add the 3rd master I am getting the following error
Strangely it looks like the cluster is still trying to communicate with the dead node although it is no longer under nodes
Steps To Reproduce:
The text was updated successfully, but these errors were encountered: