Cannot join 3rd server managed etcd cluster due to `unhealthy cluster` error from etcd #2533

brandond · 2020-11-14T10:51:42Z

Environmental Info:
K3s Version:
k3s v1.19.3+k3s3 (0e4fbfe)

Node(s) CPU architecture, OS, and Version:
Linux rnd-cloud1-master3 5.4.0-53-generic #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 managed etcd servers with agents

Describe the bug:
Cannot join 3rd server to cluster due to unhealthy cluster error

Steps To Reproduce:
Unknown - somehow failed to install on the 3rd node; etcd says its in the cluster but kubernetes does not.

Expected behavior:
All servers join the cluster

Actual behavior:

root@rnd-cloud1-master1:~# kubectl get nodes
NAME                 STATUS                     ROLES         AGE   VERSION
k3s-node1            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node2            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node3            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node4            Ready                      <none>        11h   v1.19.3+k3s3
rnd-cloud1-master1   Ready,SchedulingDisabled   etcd,master   12h   v1.19.3+k3s3
rnd-cloud1-master2   Ready                      etcd,master   12h   v1.19.3+k3s3

first and second masters have been created successfully but third NO, and I try create it on several VM, but this fall down with the same error:

Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 7364.
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.388671557+03:00" level=info msg="Starting k3s v1.19.3+k3s3 (0e4fbfef)"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.469233538+03:00" level=info msg="Managed etcd cluster not yet initialized"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.496182490+03:00" level=info msg="Adding https://172.16.2.32:2380 to etcd cluster [rnd-cloud1-master2-1d7a7cff=https://172.20.1.1:2380 rnd-cloud1-master3-8daba003=https://172.16.2.13:2380 rnd-cloud1-master1-7284ec75=https://172.16.2.65:2380]"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: {"level":"warn","ts":"2020-11-14T11:06:24.497+0300","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-35e8bba2-6b57-4368-ab00-c1d65bc89957/172.20.1.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.498041869+03:00" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"
Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Failed with result 'exit-code'.

Additional context / logs:
This is taken from https://rancher-users.slack.com/archives/CGGQEHPPW/p1605341375146500?thread_ts=1605302190.133700&cid=CGGQEHPPW

It looks like etcd currently thinks it has a 3-node cluster with an offline member. Adding a 4th node would break the quorum math, so another member cannot be added until all three nodes are online, or the offline member has been deleted.

The text was updated successfully, but these errors were encountered:

adi90x · 2020-12-02T10:57:13Z

Same issue here , it seems to be random as sometime it does'nt happen ( reproducing it using same ansible script )

brandond · 2020-12-03T06:20:10Z

@adi90x can you share any logs from when you've had this occur? In particular if you had k3s service logs from both the first server and the server that failed to join, that would be immensely helpful.

adi90x · 2020-12-06T19:23:18Z

Hello,

Below are log from node trying to join ( using -v 10 ) :

déc. 06 19:13:45 new-server.dev systemd[1]: Starting Lightweight Kubernetes...
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381106   26381 interface.go:400] Looking for default routes with IPv4 addresses
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381182   26381 interface.go:408] Default route transits interface "ens18"
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381340   26381 interface.go:208] Interface ens18 is up
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381443   26381 interface.go:256] Interface "ens18" has 2 addresses :[new-server-ip/22 new-server-ipv6/64].
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381476   26381 interface.go:223] Checking addr  new-server-ip/22.
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381494   26381 interface.go:230] IP found new-server-ip
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381514   26381 interface.go:262] Found valid IPv4 address new-server-ip for interface "ens18".
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381529   26381 interface.go:414] Found active IP new-server-ip
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381567   26381 services.go:51] Setting service IP to "10.63.0.1" (read-write).
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.381606378Z" level=info msg="Starting k3s v1.19.4+k3s1 (2532c10f)"
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.493264982Z" level=info msg="Managed etcd cluster not yet initialized"
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.517010   26381 services.go:51] Setting service IP to "10.63.0.1" (read-write).
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.555782139Z" level=info msg="Adding https://new-server-ip:2380 to etcd cluster [OtherworkingNode-15bcc285=https://working-node-ip:2380 OtherworkingNode2=https://192.168.0.25:2380]"
déc. 06 19:13:46 new-server.dev k3s[26381]: {"level":"warn","ts":"2020-12-06T19:13:46.560Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e4b1042f-f0d5-406f-aef2-1d89d2df1a10/working-node-ip:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.560381600Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"
déc. 06 19:13:46 new-server.dev systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
déc. 06 19:13:46 new-server.dev systemd[1]: k3s.service: Failed with result 'exit-code'.
déc. 06 19:13:46 new-server.dev systemd[1]: Failed to start Lightweight Kubernetes.

Hope it may help , but there is nothing that seems interesting...

And from Working Server it may be more interesting :

déc. 06 19:25:52 WorkingServer k3s[1016624]: {"level":"warn","ts":"2020-12-06T19:25:52.373Z","caller":"etcdserver/server.go:1638","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"7dfd3e19622f2f72","requested-member-add":"{ID:99c960a10157bc99 RaftAttributes:{PeerURLs:[https://new-server-ip:2380] IsLearner:true} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}

So I guess it is because one of my server is publishing is internal ip as the etcd ip ?
Is there any way to force a node to publish the external ip instead of internal ?

Regards,

brandond · 2020-12-07T18:24:56Z

At this point it is expected that all etcd servers can reach each other at their private IP addresses.

adi90x · 2020-12-18T12:53:31Z

Any update on this issue ? Did anyone find a way to for etcd to listen on public ip ?

brandond · 2020-12-19T00:28:03Z

@adi90x this is not currently possible. It only advertises the private address.

More specific to this issue in particular, the member maintenance function should probably check for unhealthy (unreachable) fully-promoted etcd cluster members, and remove them from the cluster if there is not a corresponding Kubernetes cluster member.

caroline-suse-rancher · 2023-02-22T15:13:29Z

Closing due to age - can reopen if the issue re-emerges

brandond · 2024-06-05T20:24:30Z

@dberardo-com open a new issue and fill out the issue template.

brandond added area/etcd kind/bug Something isn't working priority/backlog labels Dec 19, 2020

brandond self-assigned this Dec 19, 2020

brandond mentioned this issue Dec 21, 2020

Unable to add a node after removing a failed node using embedded etcd #2732

Closed

brandond added [zube]: Next Up priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog labels Dec 21, 2020

mfrister mentioned this issue Jan 7, 2021

Node isn't deleted from kube after deleted in panel or through cli hetznercloud/hcloud-cloud-controller-manager#142

Closed

fapatel1 added this to the v1.22 - Backlog milestone Jul 29, 2021

caroline-suse-rancher modified the milestones: v1.22 - Backlog, Backlog Nov 14, 2022

toabi mentioned this issue Jan 3, 2023

Getting container logs has TLS issues, reconfiguration breaks etcd, situation unclear #6679

Closed

caroline-suse-rancher closed this as completed Feb 22, 2023

This comment was marked as resolved.

Sign in to view

k3s-io locked and limited conversation to collaborators Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot join 3rd server managed etcd cluster due to `unhealthy cluster` error from etcd #2533

Cannot join 3rd server managed etcd cluster due to `unhealthy cluster` error from etcd #2533

brandond commented Nov 14, 2020

adi90x commented Dec 2, 2020

brandond commented Dec 3, 2020

adi90x commented Dec 6, 2020 •

edited

Loading

brandond commented Dec 7, 2020

adi90x commented Dec 18, 2020

brandond commented Dec 19, 2020 •

edited

Loading

caroline-suse-rancher commented Feb 22, 2023

This comment was marked as resolved.

brandond commented Jun 5, 2024

Cannot join 3rd server managed etcd cluster due to unhealthy cluster error from etcd #2533

Cannot join 3rd server managed etcd cluster due to unhealthy cluster error from etcd #2533

Comments

brandond commented Nov 14, 2020

adi90x commented Dec 2, 2020

brandond commented Dec 3, 2020

adi90x commented Dec 6, 2020 • edited Loading

brandond commented Dec 7, 2020

adi90x commented Dec 18, 2020

brandond commented Dec 19, 2020 • edited Loading

caroline-suse-rancher commented Feb 22, 2023

This comment was marked as resolved.

brandond commented Jun 5, 2024

Cannot join 3rd server managed etcd cluster due to `unhealthy cluster` error from etcd #2533

Cannot join 3rd server managed etcd cluster due to `unhealthy cluster` error from etcd #2533

adi90x commented Dec 6, 2020 •

edited

Loading

brandond commented Dec 19, 2020 •

edited

Loading