reset: etcd not completely reset on k3s cluster #546

ldevulder · 2023-10-18T07:37:26Z

On this automated test the reset test failed : a node cannot be re-integrated in the cluster.

In the k3s logs on the node I found this issue:

Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=info msg="Handling backend connection request [node-64f413fe-badd-4de3-8c52-4d0e1484a7fb]"
Oct 18 07:32:16 node-64f413fe-badd-4de3-8c52-4d0e1484a7fb k3s[32651]: time="2023-10-18T07:32:16Z" level=fatal msg="etcd cluster join failed: duplicate node name found, please use a unique name for this node"

The etcd cluster join failed: duplicate node name found message looks like as if the reset wasn't completely/correctly done.

I saw this issue multiple time but not always, so clearly sporadic...

Find attached the Rancher Manager logs as well as the k3s ones on the failing node (at least the logs I was able to generate, as k3s is not running).

Rancher Manager logs
node logs

The text was updated successfully, but these errors were encountered:

davidcassany · 2023-10-18T07:53:23Z

Interesting, looks like the node deletion left some stuff behind... however I see in the test the machine inventory was deleted and recreated, hence to some extend the reset functionality was properly applied. I am wondering if we are properly deleting a node from a cluster 🤔

Found this issue k3s-io/k3s#2732 which is probably unrelated but in any case I think it is helpful to get a notion of what to investigate.

ldevulder · 2023-10-18T08:24:04Z

Found this issue k3s-io/k3s#2732 which is probably unrelated but in any case I think it is helpful to get a notion of what to investigate.

Not related I think and this one was fixed some times ago, but interesting one yes.

anmazzotti · 2024-01-30T10:44:15Z

Not sure I can pinpoint the root cause here.

I tried to look at the logs but could not find anything relevant. I do also miss the elemental-register logs, but I do assume resetOem and resetPersistent are passed to the elemental cli, therefore the persistent partitions should be formatted as expected. These are the default settings from the test registration.

@ldevulder , if we want to be sure, maybe we can introduce a little step where we try to see if a file still exists after reset. Pretty much the opposite of what this test does.

Regarding the particular error, this does not look to me like a failed reset problem.
| "etcd cluster join failed: duplicate node name found, please use a unique name for this node"

This means we somehow got the same hostname as before, the just-reset machine is trying to join the cluster as an existing node. The fact that the control plane rejects this join is actually a proof reset was successful to me, otherwise the rejoin should have gone just fine.
Is this a fleet problem then? Feels like it was trying to re-provision the deleted node, instead of creating a new one. Maybe due to race condition? We could also wait for the machine to be deleted, not only the machineinventory, that could help.

Do we have any other similar failure and is it reproducible?

ldevulder · 2024-01-30T11:01:06Z

In fact the linked test is a bit old, as the issue is opened since October. Here a CLI-K3s-Reset-RM_Stable and a CLI-RKE2-Reset-RM_Stable maybe with some more information.

The reset is triggered by adding this in MachineInventory:

annotations:
    elemental.cattle.io/resettable: "true"

And yes the default settings for reset are used, so resetOem and resetPersistent.

Regarding the etcd error this is what I have with K3s, but yes could be a side effect only but this is what I saw, that's all. The reset issue also happens with RKE2.

Do we have any other similar failure and is it reproducible?

All days, you can see the CI for reset tests ;-) It's sporadic (or was in October) but happens a lot.

We could also wait for the machine to be deleted, not only the machineinventory, that could help.

As it's sporadic maybe that's a good idea?

anmazzotti · 2024-01-31T17:43:39Z

I tried and managed to reproduce (only once) the issue in a 3 nodes cluster.
I also tried to reproduce the same issue with a 4 nodes cluster.

Logs context: test-3b8dca8d-8d02-4288-8d59-efa71b3d635a is the to-be-deleted node with 192.168.122.254 ip. Logs are taken from a different and still operative control plane node (test-c00616c7-e1e9-422c-a786-9bcd826cd8fd)

k3s-fail-3-nodes.txt
k3s-success-3-nodes.txt
k3s-success-4-nodes.txt

It looks like that with > 3 nodes this process is way smoother. I wonder if maybe this is a quorum problem, and 2 nodes left in the test scenario take some time to argue about the third node departure.

@ldevulder would it be possible to try with 4 control plane nodes instead of 3? I wonder if that could help this test.
I tried to look on how to do it, but the node_number is already 5 and I'm not sure how to increase the replicas for this test only, or if that may break things for other tests.

anmazzotti · 2024-01-31T17:49:21Z

And I can also confirm the current tests fail for the same reason described in this issue.
The node joining after reset reports:

k3s[4096]: time="2024-01-31T16:15:16Z" level=fatal msg="ETCD join failed: duplicate node name found, please use a unique name for this node"

and the nodes are left in this status:

NAME                                        STATUS     ROLES                              AGE    VERSION
test-3b8dca8d-8d02-4288-8d59-efa71b3d635a   NotReady   control-plane,etcd,master          5m8s   v1.24.8+k3s1
test-64734e0f-e205-41ab-8861-2785c6faaafe   Ready      control-plane,etcd,master,worker   19m    v1.24.8+k3s1
test-c00616c7-e1e9-422c-a786-9bcd826cd8fd   Ready      control-plane,etcd,master,worker   21m    v1.24.8+k3s1

The NotReady test-3b8dca8d-8d02-4288-8d59-efa71b3d635a node is re-created, can be noted by the shorter age, so I can't explain why the previous information about this node is still in etcd.

Another workaround to fix the tests could be to use random machineNames, instead of fixed ones.
Not a good solution and probably a bad idea (as we will lose sight on this issue here), but it's an option to have the tests green. I'd try first with 4 nodes however.

anmazzotti · 2024-02-02T14:32:35Z

PR #605 should address a corner case where a MachineInventory up for deletion may be adopted, leading to the reset plan potentially being replaced with a bootstrap plan before the elemental-system-agent can execute it.

That said, there is a major problem with Rancher where it's not possible to reliably scale down a cluster or delete machines directly.

rancher/rancher#43097

I hit this issue in 2.8.1 on a second attempt. Could not reproduce it on 2.7.6 after numerous attempts, but it was reproduced on 2.7.6 too according to the issue.

From the elemental-operator perspective, this bug will prevent the actual deletion of the MachineInventories (since the Machine deletion is stuck), so the hosts will never initiate reset.

davidcassany · 2024-04-15T07:43:05Z

@ldevulder I am removing waiting for upstream label, I don't think this was strictly related to rancher/rancher#43097. I believe it was an issue in our setup fixed by rancher/elemental#1352, lets see if it happens again, so far looks like it is working all fine for now.

ldevulder · 2024-04-15T07:49:49Z

@davidcassany Yes it's clearly better with your latest fix, thanks a lot! Anyway at some point we had issue also not related to this that happened in both CLI and UI, but UI is fine since some times now. I think we can wait for 2-3 days to be sure that the reset test is not sporadic/failing anymore and then we could close this issue.

ldevulder · 2024-05-15T15:37:43Z

I retry manually and wasn't able to reproduce the issue, so this one is fixed by rancher/elemental#1352.

ldevulder added the kind/bug Something isn't working label Oct 18, 2023

ldevulder mentioned this issue Nov 3, 2023

ci/cli: add basic reset tests rancher/elemental#1080

Merged

anmazzotti self-assigned this Jan 31, 2024

anmazzotti mentioned this issue Feb 1, 2024

Do not adopt machineinventories undergoing deletion/reset #605

Merged

kkaempf added the status/waiting-upstream label Feb 13, 2024

davidcassany mentioned this issue Apr 11, 2024

Drain before nodes before deleting them rancher/elemental#1352

Merged

davidcassany removed the status/waiting-upstream label Apr 15, 2024

ldevulder self-assigned this May 15, 2024

ldevulder closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset: etcd not completely reset on k3s cluster #546

reset: etcd not completely reset on k3s cluster #546

ldevulder commented Oct 18, 2023 •

edited

Loading

davidcassany commented Oct 18, 2023

ldevulder commented Oct 18, 2023

anmazzotti commented Jan 30, 2024 •

edited

Loading

ldevulder commented Jan 30, 2024

anmazzotti commented Jan 31, 2024 •

edited

Loading

anmazzotti commented Jan 31, 2024

anmazzotti commented Feb 2, 2024 •

edited

Loading

davidcassany commented Apr 15, 2024

ldevulder commented Apr 15, 2024

ldevulder commented May 15, 2024

reset: etcd not completely reset on k3s cluster #546

reset: etcd not completely reset on k3s cluster #546

Comments

ldevulder commented Oct 18, 2023 • edited Loading

davidcassany commented Oct 18, 2023

ldevulder commented Oct 18, 2023

anmazzotti commented Jan 30, 2024 • edited Loading

ldevulder commented Jan 30, 2024

anmazzotti commented Jan 31, 2024 • edited Loading

anmazzotti commented Jan 31, 2024

anmazzotti commented Feb 2, 2024 • edited Loading

davidcassany commented Apr 15, 2024

ldevulder commented Apr 15, 2024

ldevulder commented May 15, 2024

ldevulder commented Oct 18, 2023 •

edited

Loading

anmazzotti commented Jan 30, 2024 •

edited

Loading

anmazzotti commented Jan 31, 2024 •

edited

Loading

anmazzotti commented Feb 2, 2024 •

edited

Loading