-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA dqlite seems to require the first server to be always UP #1391
Comments
Is your k3s.yaml pointed at just the one node, or does it point at an address or hostname that will remain available when the first node is down? You might have multiple masters but it doesn't do you much good if you can't talk to any of them but the first one. |
thanks, @brandond. I didn't think it is worth to mention k3s.yaml points to all three servers however, if I address the first server that is completely down I get |
What do the logs say? I built a 3 node dqlite cluster on Ubuntu in VirtualBox just a day or two ago so I know it should work, mostly. CPU usage seemed to be unusual high so I went back to single master. |
I never said the cluster is not working the most interesting thing in the logs seems to be
when the first server is down it adds information on failing dqlite
|
oh, and now it seems I have a different message returned by kubectl
|
How long did you wait after adding the other servers before shutting down the first master? Is it possible that it hadn't finished replicating all the data and reaching quorum? Unfortunately the various k3s controllers don't export any metrics or write many non-error logs so it's quite hard to figure out what's going on, but I wonder if dqlite wasn't ready to run without the master when you shut it down. |
Hey ! I'm also having some issues by stress testings the nodes in HA embedded with 3 masters and 2 workers. |
I am facing the exact same issue. Kinda defies the purpose of HA entirely as the first node becomes single point of failure. |
I see the same issues with my HA cluster, raft gets really upset when you start knocking out master nodes, especially if one happened to be the leader at the time. |
I am also experiencing the same problem on a 2 master ARM cluster. I even have a floating IP using heartbeat so the advertised IP of the cluster has moved over to Master2. I attempted to stop/start k3s on Master2 but it hangs and never successfully resumes. After starting the service back up on Master1 I can start the service on Master2. |
@castironclay 2 master nodes does not have a quorum, please see: |
I have hit this bug in my k3s setup in DO. I have 2 nodes in EU and 1 in US. I started the cluster with the node in US, and then added the other two nodes. After stopping k3s service in US node, cluster became unresponsive, and although after restarting the service a few times workloads have shifted to the other 2 nodes, it seems that dqlite is still looking for the US node as a leader, and is incapable of handing over the leader role. Is there any information you are currently lacking? |
I have something similar with the latest version but exactly the same. I start 3 node master with vagrant. First start node1 wait to boot then node2 and 3. When I shutdown node1 the nodes 2 and 3 keep working but soon as I shutdown node 2 then on node 3 I get this error every time I call kubectl. "failed to create dqlite connection: no available dqlite leader server found" If I only have 2 nodes for example node 1 and 2 and I shutdown node 1 the same happens |
Ran into this as well, take down the initial node, the remaining masters are unable to determine a leader. I feel like this might be a problem with the documentation as the quickstart for dqlite is different from the rest of the multi-master configs because there is no fixed registration address, just instructions to point other nodes to the initial node by IP address. The remaining server processes are unable to mesh because they all point to one master and can't "figure it out". |
@brontide I also noticed the different documentation so i setup a virtual IP with keepalived and pointed the individual nodes to that IP upon cluster creation, however the behaviour stayed the same, even though the IP switched to another node when i took the first node offline. the nodes fail to find a leader.. |
@philkry yeah, nodes would not even come up. Pretty clear that the bootstrapping of the dqlite cHA cluster is lacking. I've gone back to single master for now just to get off the ground. |
I tried working around this by chaining the master nodes (master3 -> master2 -> master1) and it failed completely, Actually I was hoping to make a ring of 5 master nodes this way, but it was already broken with three nodes so I gave up. |
I'm hoping that they eventually fix the underlying bootstrapping issue in the meantime I've switched back to single master and looking to backup the db via script in case I ever need to rebuild or replace the master. Seems like they have a good idea but the execution is still lacking. |
Update 11 May 2020: still checking out if HA is indeed true by analyzing log files and testing. Update: unsure if fully solved with v1.18.2+k3s1
I also have the same issue (/w v1.17.4+k3s1). Question: My k3s.yaml on all three master nodes (they are identical):
Isn't the --server part the problem? All my other subsequent master nodes starts/executes with: This is according to https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/ My first master node starts with: |
I can confirm a new cluster running v1.18.2+k3s1 on 3 Pi 4s exhibits this issue. I spun this up using Like @remkolems I'm not sure how later nodes should be joined in terms of configuration. The init script starts k3s with I would guess not, as (Also at the time of writing my node 2 appears to be whining about not finding node 3, all after rebooting node 1 - |
Just tested by restarting my first node, during restart/after successful restart of node (and services) used the following command multiple times on all other master nodes:
I still get the expected results (Ready/NotReady/Ready). Although, I didn't dive into the log files yet. Even after restarting my second node, I get the above result. System specification |
behaviour changed for me with v1.18.2+k3s1 as well, but still no HA. during the restart logs are flooded with
followed by
until the first node is back up, i cannot connect on any of the remaining master nodes:
|
I assume #1770 meant to replace dqlite by the embedded version of etcd - so I don't think any further fixes will be made to dqlite. |
Closing issue as dqlite has been removed from K3s, embedded etcd in now used. |
Version:
k3s version v1.17.2+k3s1 (cdab19b)
Describe the bug
three nodes ha cluster
cbo1 is the first server
cbo2 and cbo3 joined to him
when either cbo2 or cbo3 are down - cluster is up
however when cbo1 is down
To Reproduce
cbo1
/usr/local/bin/k3s server --cluster-init --tls-san cbo1.ip --flannel-backend ipsec
cbo2 and cbo3
/usr/local/bin/k3s server --server https://cbo1.ip:6443 --flannel-backend ipsec
Expected behavior
kubectl get node
Actual behavior
Additional context
The text was updated successfully, but these errors were encountered: