Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reset: etcd not completely reset on k3s cluster #546

Closed
ldevulder opened this issue Oct 18, 2023 · 10 comments
Closed

reset: etcd not completely reset on k3s cluster #546

ldevulder opened this issue Oct 18, 2023 · 10 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@ldevulder
Copy link
Contributor

ldevulder commented Oct 18, 2023

On this automated test the reset test failed : a node cannot be re-integrated in the cluster.

Screenshot from 2023-10-18 10-08-41

In the k3s logs on the node I found this issue:

Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=info msg="Handling backend connection request [node-64f413fe-badd-4de3-8c52-4d0e1484a7fb]"
Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=fatal msg="etcd cluster join failed: duplicate node name found, please use a unique name for this node"

The etcd cluster join failed: duplicate node name found message looks like as if the reset wasn't completely/correctly done.

I saw this issue multiple time but not always, so clearly sporadic...

Find attached the Rancher Manager logs as well as the k3s ones on the failing node (at least the logs I was able to generate, as k3s is not running).

Rancher Manager logs
node logs

@ldevulder ldevulder added the kind/bug Something isn't working label Oct 18, 2023
@davidcassany
Copy link
Contributor

Interesting, looks like the node deletion left some stuff behind... however I see in the test the machine inventory was deleted and recreated, hence to some extend the reset functionality was properly applied. I am wondering if we are properly deleting a node from a cluster 🤔

Found this issue k3s-io/k3s#2732 which is probably unrelated but in any case I think it is helpful to get a notion of what to investigate.

@ldevulder
Copy link
Contributor Author

Found this issue k3s-io/k3s#2732 which is probably unrelated but in any case I think it is helpful to get a notion of what to investigate.

Not related I think and this one was fixed some times ago, but interesting one yes.

@anmazzotti
Copy link
Contributor

anmazzotti commented Jan 30, 2024

Not sure I can pinpoint the root cause here.

I tried to look at the logs but could not find anything relevant. I do also miss the elemental-register logs, but I do assume resetOem and resetPersistent are passed to the elemental cli, therefore the persistent partitions should be formatted as expected. These are the default settings from the test registration.

@ldevulder , if we want to be sure, maybe we can introduce a little step where we try to see if a file still exists after reset. Pretty much the opposite of what this test does.

Regarding the particular error, this does not look to me like a failed reset problem.
| "etcd cluster join failed: duplicate node name found, please use a unique name for this node"

This means we somehow got the same hostname as before, the just-reset machine is trying to join the cluster as an existing node. The fact that the control plane rejects this join is actually a proof reset was successful to me, otherwise the rejoin should have gone just fine.
Is this a fleet problem then? Feels like it was trying to re-provision the deleted node, instead of creating a new one. Maybe due to race condition? We could also wait for the machine to be deleted, not only the machineinventory, that could help.

Do we have any other similar failure and is it reproducible?

@ldevulder
Copy link
Contributor Author

In fact the linked test is a bit old, as the issue is opened since October. Here a CLI-K3s-Reset-RM_Stable and a CLI-RKE2-Reset-RM_Stable maybe with some more information.

The reset is triggered by adding this in MachineInventory:

annotations:
    elemental.cattle.io/resettable: "true"

And yes the default settings for reset are used, so resetOem and resetPersistent.

Regarding the etcd error this is what I have with K3s, but yes could be a side effect only but this is what I saw, that's all. The reset issue also happens with RKE2.

Do we have any other similar failure and is it reproducible?

All days, you can see the CI for reset tests ;-) It's sporadic (or was in October) but happens a lot.

We could also wait for the machine to be deleted, not only the machineinventory, that could help.

As it's sporadic maybe that's a good idea?

@anmazzotti anmazzotti self-assigned this Jan 31, 2024
@anmazzotti
Copy link
Contributor

anmazzotti commented Jan 31, 2024

I tried and managed to reproduce (only once) the issue in a 3 nodes cluster.
I also tried to reproduce the same issue with a 4 nodes cluster.

Logs context: test-3b8dca8d-8d02-4288-8d59-efa71b3d635a is the to-be-deleted node with 192.168.122.254 ip. Logs are taken from a different and still operative control plane node (test-c00616c7-e1e9-422c-a786-9bcd826cd8fd)

k3s-fail-3-nodes.txt
k3s-success-3-nodes.txt
k3s-success-4-nodes.txt

It looks like that with > 3 nodes this process is way smoother. I wonder if maybe this is a quorum problem, and 2 nodes left in the test scenario take some time to argue about the third node departure.

@ldevulder would it be possible to try with 4 control plane nodes instead of 3? I wonder if that could help this test.
I tried to look on how to do it, but the node_number is already 5 and I'm not sure how to increase the replicas for this test only, or if that may break things for other tests.

@anmazzotti
Copy link
Contributor

And I can also confirm the current tests fail for the same reason described in this issue.
The node joining after reset reports:

k3s[4096]: time="2024-01-31T16:15:16Z" level=fatal msg="ETCD join failed: duplicate node name found, please use a unique name for this node"

and the nodes are left in this status:

NAME                                        STATUS     ROLES                              AGE    VERSION
test-3b8dca8d-8d02-4288-8d59-efa71b3d635a   NotReady   control-plane,etcd,master          5m8s   v1.24.8+k3s1
test-64734e0f-e205-41ab-8861-2785c6faaafe   Ready      control-plane,etcd,master,worker   19m    v1.24.8+k3s1
test-c00616c7-e1e9-422c-a786-9bcd826cd8fd   Ready      control-plane,etcd,master,worker   21m    v1.24.8+k3s1

The NotReady test-3b8dca8d-8d02-4288-8d59-efa71b3d635a node is re-created, can be noted by the shorter age, so I can't explain why the previous information about this node is still in etcd.

Another workaround to fix the tests could be to use random machineNames, instead of fixed ones.
Not a good solution and probably a bad idea (as we will lose sight on this issue here), but it's an option to have the tests green. I'd try first with 4 nodes however.

@anmazzotti
Copy link
Contributor

anmazzotti commented Feb 2, 2024

PR #605 should address a corner case where a MachineInventory up for deletion may be adopted, leading to the reset plan potentially being replaced with a bootstrap plan before the elemental-system-agent can execute it.

That said, there is a major problem with Rancher where it's not possible to reliably scale down a cluster or delete machines directly.

rancher/rancher#43097

I hit this issue in 2.8.1 on a second attempt. Could not reproduce it on 2.7.6 after numerous attempts, but it was reproduced on 2.7.6 too according to the issue.

From the elemental-operator perspective, this bug will prevent the actual deletion of the MachineInventories (since the Machine deletion is stuck), so the hosts will never initiate reset.

@davidcassany
Copy link
Contributor

@ldevulder I am removing waiting for upstream label, I don't think this was strictly related to rancher/rancher#43097. I believe it was an issue in our setup fixed by rancher/elemental#1352, lets see if it happens again, so far looks like it is working all fine for now.

@ldevulder
Copy link
Contributor Author

@davidcassany Yes it's clearly better with your latest fix, thanks a lot! Anyway at some point we had issue also not related to this that happened in both CLI and UI, but UI is fine since some times now. I think we can wait for 2-3 days to be sure that the reset test is not sporadic/failing anymore and then we could close this issue.

@ldevulder ldevulder self-assigned this May 15, 2024
@ldevulder
Copy link
Contributor Author

I retry manually and wasn't able to reproduce the issue, so this one is fixed by rancher/elemental#1352.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants