Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A deleted Machine can block cluster initialization indefinitely (flake) #5814

Closed
gab-satchi opened this issue Dec 7, 2021 · 6 comments · Fixed by #5824
Closed

A deleted Machine can block cluster initialization indefinitely (flake) #5814

gab-satchi opened this issue Dec 7, 2021 · 6 comments · Fixed by #5824
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@gab-satchi
Copy link
Member

What steps did you take and what happened:

  • We found a cluster that was stuck initializing the first control plane.
  • The kubeadmConfig for the machine was in a loop with "A control plane is already being initialized, requeing until control plane is ready"
  • Looking at the configMap holding the init lock for that cluster, we found a machine that no longer exists.

The theory is the lock failed to release here and the machine + kubeadm config was removed before the next reconcile which left the outdated configMap lock behind.

What did you expect to happen:

  • the lock will eventually get released if the machine owning it no longer exists

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: v0.3.23
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 7, 2021
@vincepri
Copy link
Member

vincepri commented Dec 7, 2021

/assign @killianmuldoon
Let's investigate if the lock removal is not re-entrant

/milestone v1.1

@k8s-ci-robot k8s-ci-robot added this to the v1.1 milestone Dec 7, 2021
@randomvariable
Copy link
Member

For a hint on reproducability, create X number of workload clusters simultaneously above the controller concurrency limit. The issue was reported when CAPV was being used. Would be interesting to see if it can be reproduced using CAPD.

@killianmuldoon
Copy link
Contributor

For a hint on reproducability, create X number of workload clusters simultaneously above the controller concurrency limit. The issue was reported when CAPV was being used. Would be interesting to see if it can be reproduced using CAPD.

I can't seem to reproduce on CAPD as described - I've got the concurrency limits for clusters, machines and kubeadmbootstrap set to 1 (and a couple of combinations of them set and unset) and I'm able to initialize 10 cluster control planes simultaneously (though it's heating the room nicely 😄 )

@killianmuldoon
Copy link
Contributor

I haven't been able to reproduce the flake - but there is no unlocking mechanism for the ControlPlaneInitMutex if a machine is deleted - I've been able to reproduce that locally and put a fix into the locking mechanism to check if the exiting lock is valid in #5824

@gab-satchi do you have some way to reproduce this flake?

@killianmuldoon
Copy link
Contributor

#5824 may be the source of this issue. This issue should be reopened if the test continues to be flaky.

@killianmuldoon
Copy link
Contributor

#5855 Ports this to release 1.0.X
#5856 Ports this to release 0.4.X
#5860 Ports this to release 0.3.X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants