-
-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster cannot be created because of (second) server node's issue #619
Comments
the second server (server-1) node's creation was unstable (up and then restarting and up again and turned off) logs whereas, at the first server node (server-0)
|
@iwilltry42 sorry for bothering you, but if there is any debugging tricks I can use.. please suggest. Similar issue(s) - i'll keep updating this. |
Hi @nguyenvulong , thanks for opening this issue! |
Running into the same issue. Running with env variable Edit: It actually seems like node-0 is in a crashloop:
|
can you screenshot this? |
Sure: For me server-0 was stuck in a crashloop for quite some time (see server-0 log). But after a few minutes it evantually started successfully (see server-0-success log). But now server-1 is stuck in a crashloop (see server-1 log). For completeness sake, here are the logs of any other container running: some background info on my system:
|
@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has
see this #612 , you may want to open a new issue if it does not help. |
Thanks, I'll have a look 🙂 Edit: Workaround did the trick in my case! 🎉 I hope you find a solution to your problem soon 😕 |
Glad to hear that @Jasper-Ben . I'll keep everyone updated with my case if I find something |
After failing with one physical server, i tried to reproduce the experiment with another physical server: the result is "one is working and one is not" This is a normally run (physical) server: left side is server-node-0, right side is server-node-1. Pay attention to the running processesThis is the troublesome (physical) server that I got issue, server-node-1 is not working properly ...left side is server-node-0, right side is server-node-1 It is obvious that there are some services not running, they are
Do you have any idea of how to manually start them @iwilltry42 . |
Hi @nguyenvulong , thanks for investing the time in investigating here!
k3d does only start k3s via the docker image, all the rest is started by k3s. That means, that the k3s bootstrap process doesn't finish successfully, according to the logs, because it cannot connect to the initializing server node properly. Maybe this is interesting for @brandond as well or he has some input from the logs? 🤔 UPDATE: I feel like those log lines are not only "Info" level 🤔
|
@iwilltry42 Actually I have tested with 4 physical machines. All information below (the failed one is the server 2) Let me summarize what i have done when i tried to fix it
(I have no idea about this one, not sure it is even related. But at least the other three physical machines don't have that.)
And thank you so much for spending time on this. I have checked other issues in this repository and some of them may be related (see this comment) Information of the physical machine 1 (working)OS
Kernel
Information of the physical machine 2 (NOT working)OS
Kernel Docker
Physical Machine 3 (working)
Physical Machine 4 (working)
UPDATE:
one of my wild guesses was the 2nd node (node-1) may have failed because it had something to do with node-0, but well, let me see if i can find anything... |
Wait I'm confused, are you trying to make a k3d cluster out of 4 separate hosts? Normally you would use k3d to mock up a multi-node cluster on a single host; if you want to build a cluster out of multiple hosts why not just use K3s? |
No @brandond , I was trying to create k3d cluster on a single physical machine. However I could not do that due to some errors when k3d trying to spawn a second node (server node, aka node-1, hence the name of this thread) Because of that, I tried to install k3d onto 3 additional physical machines, independently, and repeat the same test. Above are their configurations. As you can see, only one physical machine got that problem, the rest were fine. P/S. |
@nguyenvulong , I just tried, but couldn't replicate your issue with those OS and docker versions 🤔
Can you link them here please, so we can try to connect the dots? |
I will update that machine to the latest and let you know within 24h. Thanks a lot for your time.
I linked them here in this #619 (comment) |
@iwilltry42
I attach the logs of two server nodes below. |
So I inspected the logs again and something is wrong there with the etcd connection. server-1 is added to the etcd cluster according to the server-0 logs:
Looks like server-1 is then trying to initialize, but right after opening the backend db, it fails to get the members from server-0 and closes etcd:
What's unexpected is, that it just goes ahead to start the API Server:
In the aftermath, this obviously fails to connect to the local etcd, as that was closed already:
server-0 then eventually removes the failed learner:
And this just goes on and on without exiting, so k3d doesn't catch the error 🤔 UPDATE 1: At the same time that server-1 logs the EOF error (after which it shutsdown etcd), the server-0 logs this:
No clue if that's related, as @brandond , should this be catched in k3s, that it doesn't just continue with the init process once the etcd server failed? 🤔 @nguyenvulong, this means, that your issue is different from the others that you've found so far. |
Here it is @iwilltry42 |
Yeah, our etcd ready check clearly isn't handling this edge case well. I do wonder what's going on with your Docker network that's causing the new member to get an EOF when trying to talk to the first member:
|
@brandond exactly my thought... I'm a bit lost here 🙄 |
Would it be possible to start up another container in the same network, and run a curl test? Note that this should be done from another container, NOT by execing into the server container. |
I re-ran a test with a new version of
INFO
LOGS --verbose
|
@nguyenvulong , that's still on one single physical host where it's failing, while working on all other hosts? |
Hello, finally i was able to make it work. I leave some commands and logs below for those who might have similar problems. Commands executed
Create
|
Original problem: I tried to create a cluster with multiple nodes in a single physical machine, this is the command
k3d cluster create -a3 -s3 test3 --trace
and the process stuck forever at this point, when the cluster was firing up the second server node:
It took a while for k3d-test3-server-1 (the second server node) to start (it kept restarting several times before that). This resulted in other nodes (including servers and agents) could not start up. You can see the "created" status in
docker ps
outputI have tried to purge everything from docker packages and reinstall them (I tried both docker-ce and apt package). Nothing worked so far. It stuck there forever. The weird thing is my other Ubuntu servers did not have the problem. Therefore I suspect this is a local issue - BUT i have tried to reinstall docker and even switch to other linux user just to make sure there is nothing in between k3d and docker. Any suggestions for debugging are really appreciated.
k3d version
OS version
Docker Info:
Additional logs
The text was updated successfully, but these errors were encountered: