Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes aren't joining a new cluster when using an external DB #3226

Closed
rudimk opened this issue Apr 23, 2021 · 16 comments
Closed

Nodes aren't joining a new cluster when using an external DB #3226

rudimk opened this issue Apr 23, 2021 · 16 comments
Assignees
Labels
kind/bug Something isn't working kind/internal priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@rudimk
Copy link

rudimk commented Apr 23, 2021

Environmental Info:
K3s Version:
v1.20.6+k3s1 (8d04328)

Node(s) CPU architecture, OS, and Version:
Linux leankube-master-2 5.4.0-1038-aws #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Just two servers for running a Rancher cluster. No agents.

Describe the bug:
When one spins up multiple server nodes with an external DB backend, the first node starts up okay, but the rest don't join the cluster. This is essentially the same behaviour as #3130.

Steps To Reproduce:

  • Installed K3s: Downloaded the installer script from https://get.k3s.io, and ran that: ./k3sInstaller.sh mysql://user:[email protected]:3306)/rancher" --tls-san Y.Y.Y.Y
  • Repeat on all nodes.

Expected behavior:

All nodes join the cluster.

Actual behavior:

The first node works okay. Other nodes don't join the cluster and throw exceptions about being unable to read/verify certificates.

Additional context / logs:

Apr 23 19:27:17 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:17.429965524Z" level=error msg="Failed to authenticate request from 172.31.7.94:50818: [x509: certificate signed by unknown authority, verifying certificate SN=8450779571907569382, SKID=, AKID=8C:C2:52:8A:37:23:D0:66:80:A4:EE:67:1B:41:21:AC:5A:F0:4D:B0 failed: x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate \"k3s-client-ca@1619205961\")]"
Apr 23 19:27:17 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:17.753177982Z" level=info msg="Waiting for control-plane node leankube-master-2 startup: nodes \"leankube-master-2\" not found"
Apr 23 19:27:18 leankube-master-2 k3s[17118]: I0423 19:27:18.248879   17118 kubelet.go:449] kubelet nodes not sync
Apr 23 19:27:18 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:18.377884930Z" level=info msg="Cluster-Http-Server 2021/04/23 19:27:18 http: TLS handshake error from 127.0.0.1:37414: remote error: tls: bad certificate"
@brandond
Copy link
Member

brandond commented Apr 23, 2021

Can you attach the complete K3s service logs from all the nodes?

Are you by any chance running the installer on nodes that were previously a part of a different K3s cluster and have old cert files left on disk?

@rudimk
Copy link
Author

rudimk commented Apr 24, 2021

So something weird happened. After posting this issue, I decided to call it a night and get some shut-eye - except the cluster started working on its own.

> kubectl --kubeconfig output/65.2.28.73-kubeconfig.yaml get nodes
NAME                STATUS   ROLES                  AGE     VERSION
leankube-master-2   Ready    control-plane,master   4h41m   v1.20.6+k3s1
leankube-master-1   Ready    control-plane,master   8h      v1.20.6+k3s1

As for your question about running the installer on nodes that previously had a different cluster - I thought of that too. But I'd face the same error regardless of whether it's recycled nodes or fresh nodes. The first node - leankube-master-1 worked fine, and leankube-master-2 didn't join the cluster for about 4 hours, throwing the error I included above. Now it seems like not only it has joined the cluster - after that 4 hour gap - but it also seems like the error it kept throwing is now also being thrown by the first node - except the interesting thing is that the first node is still a part of the cluster.

One last thing. Not sure if it matters much, but both these nodes are running K3s with the --tls-san flag; I use that to specify the public IP of another VM in the same network that's running HAProxy and simply passes all traffic on *:6443 to the two nodes.

Here are the logs for the two nodes: logs.zip

@rudimk
Copy link
Author

rudimk commented Apr 29, 2021

Tried another test, with the same installer script and K3s version, but on Ubuntu 18.04 this time. Now it works just fine. Facing a wildly different issue with running the same workflow on Ubuntu 20.04 - using the downloaded kubeconfig results in Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "k3s-server-ca@1619684858") - note that is inspite of using the --tls-san flag whilst spinning up the masters to include the IP of a HAProxy VM with the appropriate config pointing towards port 6443 on the master nodes. This works just fine with Ubuntu 18.04, with no difference in the installer script or the K3s binary version, or in the underlying infrastructure.

Since I'm unable to replicate this issue now, looking for guidance on whether you'd like to close this issue or look into it further.

@dnoland1
Copy link
Contributor

@brandond I might be hitting this issue or something similar on k3s 1.20.6. I can give you full logs if you DM me.

@brandond
Copy link
Member

I am guessing that this is related to some of the certificate bootstrap sequencing changes that went into 1.20.6 for backup-restore, but nothing's jumping out at me.
v1.20.5+k3s1...v1.20.6+k3s1

@brandond brandond added the kind/bug Something isn't working label May 20, 2021
@brandond brandond added this to the v1.20.8+k3s1 milestone May 20, 2021
@brandond
Copy link
Member

brandond commented May 20, 2021

Just to be clear on the code path here, which I think was pretty likely to have always contained a race condition:

If both servers come up at the same time, they will both find the database empty, create new cluster certificates, and then store that to the database in the bootstrap key. One of them will do so first, the other one will get an error when trying to store the bootstrap data due to the key already existing. This should be a fatal error that causes it to exit, and get restarted by systemd. Due to #3015 it will not re-read the proper certificates from the datastore when it starts up - it will continue on joining the cluster using the different certificates it generated the first time it started, which were never written to the datastore.

It's possible that the improvements for backup/restore made the startup a little faster and therefore more likely to race.

@davidnuzik davidnuzik added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 21, 2021
@rudimk
Copy link
Author

rudimk commented May 23, 2021

Just to be clear on the code path here, which I think was pretty likely to have always contained a race condition:

If both servers come up at the same time, they will both find the database empty, create new cluster certificates, and then store that to the database in the bootstrap key. One of them will do so first, the other one will get an error when trying to store the bootstrap data due to the key already existing. This should be a fatal error that causes it to exit, and get restarted by systemd. Due to #3015 it will not re-read the proper certificates from the datastore when it starts up - it will continue on joining the cluster using the different certificates it generated the first time it started, which were never written to the datastore.

It's possible that the improvements for backup/restore made the startup a little faster and therefore more likely to race.

I now see how I managed to run into this error then. Usually I've installed K3s manually; but lately I've been doing it with Ansible. Ansible runs the K3s installer script on each node in parallel, and that's probably how we're running into this.

@brandond
Copy link
Member

I now see how I managed to run into this error then. Usually I've installed K3s manually; but lately I've been doing it with Ansible. Ansible runs the K3s installer script on each node in parallel, and that's probably how we're running into this

If you can reconfigure your playbook to jitter the task initiations by a couple seconds, or run them sequentially, that should fix it.

@rudimk
Copy link
Author

rudimk commented May 25, 2021

Yep I did that yesterday. Before doing that, I had run those playbooks again to deploy a production K3s cluster with three master nodes. Surprisingly that one had worked just fine - because I was doing it over a VPN and the network lag accidentally ensured the first node was run a couple of seconds before the others. So yes, it’s definitely the race condition you mentioned.

@cjellick cjellick modified the milestones: v1.20.8+k3s1, v1.20.9+k3s1 Jun 15, 2021
@fapatel1
Copy link

Will need issue #3015 to be completed as a dependency on this issue

@davidnuzik
Copy link
Contributor

Needed in July timeframe.

@cjellick
Copy link
Contributor

I think this issue will need ports for 1.21, master

@zhoub
Copy link

zhoub commented Sep 9, 2021

Bump, got same issue from 1.21.4

@cwayne18 cwayne18 modified the milestones: v1.20.11+k3s1, v1.20.12+k3s1 Sep 27, 2021
@ShylajaDevadiga
Copy link
Contributor

Closing issue as it is validated as part of #3015

$ kubectl get nodes
NAME              STATUS   ROLES                  AGE     VERSION
ip-172-31-5-79    Ready    control-plane,master   4m30s   v1.20.13+k3s1
ip-172-31-2-191   Ready    control-plane,master   4m16s   v1.20.13+k3s1
ip-172-31-9-64    Ready    <none>                 3m6s    v1.20.13+k3s1
ip-172-31-4-62    Ready    control-plane,master   6m18s   v1.20.13+k3s1

@rudimk
Copy link
Author

rudimk commented Nov 19, 2022

Hey guys - we're seeing this again, on K3s 1.20.15, on EC2 nodes running Ubuntu 20.04. Since the nodes are provisioned using an autoscaling group, there's actually no real way to introduce a jitter or delay, in order to ensure nodes don't come up in parallel.

@rudimk
Copy link
Author

rudimk commented Nov 19, 2022

Okay nvm - just found a comment from @briandowns here: #3950 (comment). Guess it's time to move up to 1.21. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/internal priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests