-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes aren't joining a new cluster when using an external DB #3226
Comments
Can you attach the complete K3s service logs from all the nodes? Are you by any chance running the installer on nodes that were previously a part of a different K3s cluster and have old cert files left on disk? |
So something weird happened. After posting this issue, I decided to call it a night and get some shut-eye - except the cluster started working on its own.
As for your question about running the installer on nodes that previously had a different cluster - I thought of that too. But I'd face the same error regardless of whether it's recycled nodes or fresh nodes. The first node - One last thing. Not sure if it matters much, but both these nodes are running K3s with the Here are the logs for the two nodes: logs.zip |
Tried another test, with the same installer script and K3s version, but on Ubuntu 18.04 this time. Now it works just fine. Facing a wildly different issue with running the same workflow on Ubuntu 20.04 - using the downloaded kubeconfig results in Since I'm unable to replicate this issue now, looking for guidance on whether you'd like to close this issue or look into it further. |
@brandond I might be hitting this issue or something similar on k3s 1.20.6. I can give you full logs if you DM me. |
I am guessing that this is related to some of the certificate bootstrap sequencing changes that went into 1.20.6 for backup-restore, but nothing's jumping out at me. |
Just to be clear on the code path here, which I think was pretty likely to have always contained a race condition: If both servers come up at the same time, they will both find the database empty, create new cluster certificates, and then store that to the database in the bootstrap key. One of them will do so first, the other one will get an error when trying to store the bootstrap data due to the key already existing. This should be a fatal error that causes it to exit, and get restarted by systemd. Due to #3015 it will not re-read the proper certificates from the datastore when it starts up - it will continue on joining the cluster using the different certificates it generated the first time it started, which were never written to the datastore. It's possible that the improvements for backup/restore made the startup a little faster and therefore more likely to race. |
I now see how I managed to run into this error then. Usually I've installed K3s manually; but lately I've been doing it with Ansible. Ansible runs the K3s installer script on each node in parallel, and that's probably how we're running into this. |
If you can reconfigure your playbook to jitter the task initiations by a couple seconds, or run them sequentially, that should fix it. |
Yep I did that yesterday. Before doing that, I had run those playbooks again to deploy a production K3s cluster with three master nodes. Surprisingly that one had worked just fine - because I was doing it over a VPN and the network lag accidentally ensured the first node was run a couple of seconds before the others. So yes, it’s definitely the race condition you mentioned. |
Will need issue #3015 to be completed as a dependency on this issue |
Needed in July timeframe. |
I think this issue will need ports for 1.21, master |
Bump, got same issue from 1.21.4 |
Closing issue as it is validated as part of #3015
|
Hey guys - we're seeing this again, on K3s 1.20.15, on EC2 nodes running Ubuntu 20.04. Since the nodes are provisioned using an autoscaling group, there's actually no real way to introduce a jitter or delay, in order to ensure nodes don't come up in parallel. |
Okay nvm - just found a comment from @briandowns here: #3950 (comment). Guess it's time to move up to 1.21. :) |
Environmental Info:
K3s Version:
v1.20.6+k3s1 (8d04328)
Node(s) CPU architecture, OS, and Version:
Linux leankube-master-2 5.4.0-1038-aws #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Just two servers for running a Rancher cluster. No agents.
Describe the bug:
When one spins up multiple server nodes with an external DB backend, the first node starts up okay, but the rest don't join the cluster. This is essentially the same behaviour as #3130.
Steps To Reproduce:
./k3sInstaller.sh mysql://user:[email protected]:3306)/rancher" --tls-san Y.Y.Y.Y
Expected behavior:
All nodes join the cluster.
Actual behavior:
The first node works okay. Other nodes don't join the cluster and throw exceptions about being unable to read/verify certificates.
Additional context / logs:
The text was updated successfully, but these errors were encountered: