-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏃CAPD automatically re-create a machine if there is an error during provisioning #3004
🏃CAPD automatically re-create a machine if there is an error during provisioning #3004
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, @fabriziopandini would you mind changing the PR title to something meaningful for the release notes?
defer func() { | ||
if retErr != nil && !dockerMachine.Spec.Bootstrapped { | ||
log.Info(fmt.Sprintf("%v, cleaning up so we can re-provision from a clean state", retErr)) | ||
if err := externalMachine.Delete(ctx); err != nil { | ||
log.Info("Failed to cleanup machine") | ||
} | ||
res = ctrl.Result{RequeueAfter: 10 * time.Second} | ||
retErr = nil | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly this would delete the underlying infrastructure regardless of the returned error if the actual docker machine was never bootstrapped.
Should we try instead to add some retry logic in the container bootstrapping mechanism, or do we prefer to do it in the controller here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the error logs, when the container does not start properly we start getting weird errors like can't create pki folder, and those error does not go away after many retries :-(
/milestone v0.3.6 |
/test pull-cluster-api-e2e |
I think this will improve the developer UX and reduce test flakes. On the other hand, it can mask the root cause of the bootstrap issues, right? @fabriziopandini Would it be possible to log the bootstrap issues, and then delete the container? |
Since this would help reduce test flakiness, I think we can address logging errors in a later PR. Thanks @fabriziopandini! /lgtm |
What this PR does / why we need it:
This PR makes CAPD recover from conditions when the docker provider creates a container for a machine, but for some reason, the container is not fully operational/does not completes all the provisioning steps.
More specifically, given that the CAPD provisioning time is small, the PR cleanups containers with provisioning errors and re-provision from scratch
Which issue(s) this PR fixes:
Fixes #2999
Fixes #2341
/assing @vincepri
/assing @sedefsavas