-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset: etcd not completely reset on k3s cluster #546
Comments
Interesting, looks like the node deletion left some stuff behind... however I see in the test the machine inventory was deleted and recreated, hence to some extend the reset functionality was properly applied. I am wondering if we are properly deleting a node from a cluster 🤔 Found this issue k3s-io/k3s#2732 which is probably unrelated but in any case I think it is helpful to get a notion of what to investigate. |
Not related I think and this one was fixed some times ago, but interesting one yes. |
Not sure I can pinpoint the root cause here. I tried to look at the logs but could not find anything relevant. I do also miss the @ldevulder , if we want to be sure, maybe we can introduce a little step where we try to see if a file still exists after reset. Pretty much the opposite of what this test does. Regarding the particular error, this does not look to me like a failed reset problem. This means we somehow got the same hostname as before, the just-reset machine is trying to join the cluster as an existing node. The fact that the control plane rejects this join is actually a proof reset was successful to me, otherwise the rejoin should have gone just fine. Do we have any other similar failure and is it reproducible? |
In fact the linked test is a bit old, as the issue is opened since October. Here a CLI-K3s-Reset-RM_Stable and a CLI-RKE2-Reset-RM_Stable maybe with some more information. The reset is triggered by adding this in annotations:
elemental.cattle.io/resettable: "true" And yes the default settings for reset are used, so Regarding the
All days, you can see the CI for reset tests ;-) It's sporadic (or was in October) but happens a lot.
As it's sporadic maybe that's a good idea? |
I tried and managed to reproduce (only once) the issue in a 3 nodes cluster. Logs context: k3s-fail-3-nodes.txt It looks like that with > 3 nodes this process is way smoother. I wonder if maybe this is a quorum problem, and 2 nodes left in the test scenario take some time to argue about the third node departure. @ldevulder would it be possible to try with 4 control plane nodes instead of 3? I wonder if that could help this test. |
And I can also confirm the current tests fail for the same reason described in this issue.
and the nodes are left in this status:
The NotReady Another workaround to fix the tests could be to use random machineNames, instead of fixed ones. |
PR #605 should address a corner case where a MachineInventory up for deletion may be adopted, leading to the reset plan potentially being replaced with a bootstrap plan before the elemental-system-agent can execute it. That said, there is a major problem with Rancher where it's not possible to reliably scale down a cluster or delete machines directly. I hit this issue in 2.8.1 on a second attempt. Could not reproduce it on 2.7.6 after numerous attempts, but it was reproduced on 2.7.6 too according to the issue. From the elemental-operator perspective, this bug will prevent the actual deletion of the MachineInventories (since the Machine deletion is stuck), so the hosts will never initiate reset. |
@ldevulder I am removing waiting for upstream label, I don't think this was strictly related to rancher/rancher#43097. I believe it was an issue in our setup fixed by rancher/elemental#1352, lets see if it happens again, so far looks like it is working all fine for now. |
@davidcassany Yes it's clearly better with your latest fix, thanks a lot! Anyway at some point we had issue also not related to this that happened in both CLI and UI, but UI is fine since some times now. I think we can wait for 2-3 days to be sure that the reset test is not sporadic/failing anymore and then we could close this issue. |
I retry manually and wasn't able to reproduce the issue, so this one is fixed by rancher/elemental#1352. |
On this automated test the
reset
test failed : a node cannot be re-integrated in the cluster.In the k3s logs on the node I found this issue:
The
etcd cluster join failed: duplicate node name found
message looks like as if thereset
wasn't completely/correctly done.I saw this issue multiple time but not always, so clearly sporadic...
Find attached the Rancher Manager logs as well as the k3s ones on the failing node (at least the logs I was able to generate, as k3s is not running).
Rancher Manager logs
node logs
The text was updated successfully, but these errors were encountered: