A deleted Machine can block cluster initialization indefinitely (flake) #5814

gab-satchi · 2021-12-07T14:56:26Z

What steps did you take and what happened:

We found a cluster that was stuck initializing the first control plane.
The kubeadmConfig for the machine was in a loop with "A control plane is already being initialized, requeing until control plane is ready"
Looking at the configMap holding the init lock for that cluster, we found a machine that no longer exists.

The theory is the lock failed to release here and the machine + kubeadm config was removed before the next reconcile which left the outdated configMap lock behind.

What did you expect to happen:

the lock will eventually get released if the machine owning it no longer exists

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: v0.3.23
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

vincepri · 2021-12-07T15:43:13Z

/assign @killianmuldoon
Let's investigate if the lock removal is not re-entrant

/milestone v1.1

randomvariable · 2021-12-08T13:09:23Z

For a hint on reproducability, create X number of workload clusters simultaneously above the controller concurrency limit. The issue was reported when CAPV was being used. Would be interesting to see if it can be reproduced using CAPD.

killianmuldoon · 2021-12-08T15:01:47Z

For a hint on reproducability, create X number of workload clusters simultaneously above the controller concurrency limit. The issue was reported when CAPV was being used. Would be interesting to see if it can be reproduced using CAPD.

I can't seem to reproduce on CAPD as described - I've got the concurrency limits for clusters, machines and kubeadmbootstrap set to 1 (and a couple of combinations of them set and unset) and I'm able to initialize 10 cluster control planes simultaneously (though it's heating the room nicely 😄 )

killianmuldoon · 2021-12-08T18:34:22Z

I haven't been able to reproduce the flake - but there is no unlocking mechanism for the ControlPlaneInitMutex if a machine is deleted - I've been able to reproduce that locally and put a fix into the locking mechanism to check if the exiting lock is valid in #5824

@gab-satchi do you have some way to reproduce this flake?

killianmuldoon · 2021-12-13T16:22:03Z

#5824 may be the source of this issue. This issue should be reopened if the test continues to be flaky.

killianmuldoon · 2021-12-14T12:06:19Z

#5855 Ports this to release 1.0.X
#5856 Ports this to release 0.4.X
#5860 Ports this to release 0.3.X

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 7, 2021

k8s-ci-robot assigned killianmuldoon Dec 7, 2021

k8s-ci-robot added this to the v1.1 milestone Dec 7, 2021

killianmuldoon mentioned this issue Dec 8, 2021

🐛 Add unlock mechanism to the kubeadm bootstrap provider #5824

Merged

k8s-ci-robot closed this as completed in #5824 Dec 13, 2021

fabriziopandini mentioned this issue Dec 13, 2021

🐛 Add unlock if bootstrap machine holding lock does not exist #5855

Merged

This was referenced Dec 14, 2021

🐛 Add unlock if bootstrap machine holding lock does not exist #5856

Merged

Add unlock if bootstrap machine holding lock does not exist #5860

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A deleted Machine can block cluster initialization indefinitely (flake) #5814

A deleted Machine can block cluster initialization indefinitely (flake) #5814

gab-satchi commented Dec 7, 2021

vincepri commented Dec 7, 2021

randomvariable commented Dec 8, 2021

killianmuldoon commented Dec 8, 2021

killianmuldoon commented Dec 8, 2021

killianmuldoon commented Dec 13, 2021

killianmuldoon commented Dec 14, 2021

A deleted Machine can block cluster initialization indefinitely (flake) #5814

A deleted Machine can block cluster initialization indefinitely (flake) #5814

Comments

gab-satchi commented Dec 7, 2021

vincepri commented Dec 7, 2021

randomvariable commented Dec 8, 2021

killianmuldoon commented Dec 8, 2021

killianmuldoon commented Dec 8, 2021

killianmuldoon commented Dec 13, 2021

killianmuldoon commented Dec 14, 2021