Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot join 3rd server managed etcd cluster due to unhealthy cluster error from etcd #2533

Closed
brandond opened this issue Nov 14, 2020 · 9 comments
Assignees
Labels
area/etcd kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@brandond
Copy link
Member

Environmental Info:
K3s Version:
k3s v1.19.3+k3s3 (0e4fbfe)

Node(s) CPU architecture, OS, and Version:
Linux rnd-cloud1-master3 5.4.0-53-generic #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 managed etcd servers with agents

Describe the bug:
Cannot join 3rd server to cluster due to unhealthy cluster error

Steps To Reproduce:
Unknown - somehow failed to install on the 3rd node; etcd says its in the cluster but kubernetes does not.

Expected behavior:
All servers join the cluster

Actual behavior:

root@rnd-cloud1-master1:~# kubectl get nodes
NAME                 STATUS                     ROLES         AGE   VERSION
k3s-node1            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node2            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node3            Ready                      <none>        11h   v1.19.3+k3s3
k3s-node4            Ready                      <none>        11h   v1.19.3+k3s3
rnd-cloud1-master1   Ready,SchedulingDisabled   etcd,master   12h   v1.19.3+k3s3
rnd-cloud1-master2   Ready                      etcd,master   12h   v1.19.3+k3s3

first and second masters have been created successfully but third NO, and I try create it on several VM, but this fall down with the same error:

Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 7364.
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.388671557+03:00" level=info msg="Starting k3s v1.19.3+k3s3 (0e4fbfef)"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.469233538+03:00" level=info msg="Managed etcd cluster not yet initialized"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.496182490+03:00" level=info msg="Adding https://172.16.2.32:2380 to etcd cluster [rnd-cloud1-master2-1d7a7cff=https://172.20.1.1:2380 rnd-cloud1-master3-8daba003=https://172.16.2.13:2380 rnd-cloud1-master1-7284ec75=https://172.16.2.65:2380]"
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: {"level":"warn","ts":"2020-11-14T11:06:24.497+0300","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-35e8bba2-6b57-4368-ab00-c1d65bc89957/172.20.1.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Nov 14 11:06:24 rnd-cloud1-master3 k3s[225167]: time="2020-11-14T11:06:24.498041869+03:00" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"
Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Nov 14 11:06:24 rnd-cloud1-master3 systemd[1]: k3s.service: Failed with result 'exit-code'.

Additional context / logs:
This is taken from https://rancher-users.slack.com/archives/CGGQEHPPW/p1605341375146500?thread_ts=1605302190.133700&cid=CGGQEHPPW

It looks like etcd currently thinks it has a 3-node cluster with an offline member. Adding a 4th node would break the quorum math, so another member cannot be added until all three nodes are online, or the offline member has been deleted.

@adi90x
Copy link

adi90x commented Dec 2, 2020

Same issue here , it seems to be random as sometime it does'nt happen ( reproducing it using same ansible script )

@brandond
Copy link
Member Author

brandond commented Dec 3, 2020

@adi90x can you share any logs from when you've had this occur? In particular if you had k3s service logs from both the first server and the server that failed to join, that would be immensely helpful.

@adi90x
Copy link

adi90x commented Dec 6, 2020

Hello,

Below are log from node trying to join ( using -v 10 ) :

déc. 06 19:13:45 new-server.dev systemd[1]: Starting Lightweight Kubernetes...
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381106   26381 interface.go:400] Looking for default routes with IPv4 addresses
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381182   26381 interface.go:408] Default route transits interface "ens18"
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381340   26381 interface.go:208] Interface ens18 is up
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381443   26381 interface.go:256] Interface "ens18" has 2 addresses :[new-server-ip/22 new-server-ipv6/64].
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381476   26381 interface.go:223] Checking addr  new-server-ip/22.
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381494   26381 interface.go:230] IP found new-server-ip
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381514   26381 interface.go:262] Found valid IPv4 address new-server-ip for interface "ens18".
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381529   26381 interface.go:414] Found active IP new-server-ip
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.381567   26381 services.go:51] Setting service IP to "10.63.0.1" (read-write).
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.381606378Z" level=info msg="Starting k3s v1.19.4+k3s1 (2532c10f)"
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.493264982Z" level=info msg="Managed etcd cluster not yet initialized"
déc. 06 19:13:46 new-server.dev k3s[26381]: I1206 19:13:46.517010   26381 services.go:51] Setting service IP to "10.63.0.1" (read-write).
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.555782139Z" level=info msg="Adding https://new-server-ip:2380 to etcd cluster [OtherworkingNode-15bcc285=https://working-node-ip:2380 OtherworkingNode2=https://192.168.0.25:2380]"
déc. 06 19:13:46 new-server.dev k3s[26381]: {"level":"warn","ts":"2020-12-06T19:13:46.560Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e4b1042f-f0d5-406f-aef2-1d89d2df1a10/working-node-ip:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
déc. 06 19:13:46 new-server.dev k3s[26381]: time="2020-12-06T19:13:46.560381600Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"
déc. 06 19:13:46 new-server.dev systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
déc. 06 19:13:46 new-server.dev systemd[1]: k3s.service: Failed with result 'exit-code'.
déc. 06 19:13:46 new-server.dev systemd[1]: Failed to start Lightweight Kubernetes.

Hope it may help , but there is nothing that seems interesting...

And from Working Server it may be more interesting :

déc. 06 19:25:52 WorkingServer k3s[1016624]: {"level":"warn","ts":"2020-12-06T19:25:52.373Z","caller":"etcdserver/server.go:1638","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"7dfd3e19622f2f72","requested-member-add":"{ID:99c960a10157bc99 RaftAttributes:{PeerURLs:[https://new-server-ip:2380] IsLearner:true} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}

So I guess it is because one of my server is publishing is internal ip as the etcd ip ?
Is there any way to force a node to publish the external ip instead of internal ?

Regards,

@brandond
Copy link
Member Author

brandond commented Dec 7, 2020

At this point it is expected that all etcd servers can reach each other at their private IP addresses.

@adi90x
Copy link

adi90x commented Dec 18, 2020

Any update on this issue ? Did anyone find a way to for etcd to listen on public ip ?

@brandond
Copy link
Member Author

brandond commented Dec 19, 2020

@adi90x this is not currently possible. It only advertises the private address.

More specific to this issue in particular, the member maintenance function should probably check for unhealthy (unreachable) fully-promoted etcd cluster members, and remove them from the cluster if there is not a corresponding Kubernetes cluster member.

@caroline-suse-rancher
Copy link
Contributor

Closing due to age - can reopen if the issue re-emerges

@dberardo-com

This comment was marked as resolved.

@brandond
Copy link
Member Author

brandond commented Jun 5, 2024

@dberardo-com open a new issue and fill out the issue template.

@k3s-io k3s-io locked and limited conversation to collaborators Jun 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/etcd kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Closed
Archived in project
Development

No branches or pull requests

5 participants