-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612
Comments
I wonder if it would make sense to move the deferred call to |
We should probably also increase the default concurrency on KubeadmControlPlane reconciliation 😂 |
I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release. |
/milestone v0.3.x |
Increasing default concurrency is a different problem and won't help in this case, but +1 on increasing it to 10 regardless. The reason it won't help is that increasing a controller's concurrency will still only reconcile each object serially and not reconcile the same object concurrently. Since each cluster only has one KCP resource, the hanging is still a problem. This will, however, improve the situation when there are multiple control planes being managed by this controller. I'm +1 on adding some concurrency within the control loop. Probably 1 go routine for each delete (== 1 go routine / node) sounds pretty reasonable. This would be similar to what we had originally (before health checks). |
The concurrency comment was more of an aside. I do think we need to consider moving the call to updateStatus to a defer that is configured after |
I'm thinking of investigating this further. Has anyone seen this behavior with CAPA? |
@wfernandes you should be able to replicate by attempting to delete prior to the first control plane Machine becoming ready. |
/assign |
/lifecycle active |
It seems that this issue was fixed as part of this PR: #2708. |
@wfernandes: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
In CAPV, I created a new cluster, followed by deleting it. Under the default of 1 worker, KCP controller will seemingly hang after the machines are deleted, eventually printing (I have added some extra log lines in my branch to trace what's going on):
What did you expect to happen:
Resources to be deleted within 60s.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Increasing workers would be a suitable workaround for now. Finding a way to synchronise on a channel and cancel outstanding reconciliations on delete is maybe one potential optimisation.
Environment:
kubectl version
):/etc/os-release
):/kind bug
The text was updated successfully, but these errors were encountered: