-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed first control plane node creates unrecoverable failure #2960
Comments
There is a unique challenge here. If we can differentiate between the control plane never initialized and the control plane existed but is no longer there, then I think we can safely retry initialization on failed creation. I'm not sure we'd ever want to "re-initialize" a control plane that was present but is no longer there, since recreating the "control plane" despite the data loss incurred during the process would potentially cause more confusion than problems it would solve. |
I'm completely out of my depth here, but wouldn't CABPK or KCP have knowledge as to whether the control plane ever initialized or not? I completely agree we'd want to avoid "re-initializing" a control plane. |
/priority awaiting-more-evidence |
CABPK is very much in the fire and forget category. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
/milestone v0.4.0 We should probably tackle this as part of KCP in a very narrow scenario (initialized) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
/milestone v0.4.x |
/close As of today the easiest path is to delete and re-create. |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
I created a workload cluster on AWS that would use existing infrastructure. The instance backing the first control plane node failed unexpected (some sort of AWS failure), generating errors like this in the CAPA controller manager log:
In talking with @randomvariable, he suggested deleting the Machine object, which I did, and that removed the failed EC2 instance and created new Machine and AWSMachine objects, but further reconciliation never happened. Naadir indicated that this was "technically correct" behavior and that the KCP would never recover, but suggested raising this issue to possibly improve docs or surface errors better.
What did you expect to happen:
I expected the workload cluster to be created.
Anything else you would like to add:
N/A
Environment:
kubectl version
): v1.16.4 (management), v1.17.3 (workload)/etc/os-release
): Ubuntu 18.04/kind bug
The text was updated successfully, but these errors were encountered: