Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏃CAPD automatically re-create a machine if there is an error during provisioning #3004

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ func (r *DockerMachineReconciler) Reconcile(req ctrl.Request) (_ ctrl.Result, re
return r.reconcileNormal(ctx, machine, dockerMachine, externalMachine, externalLoadBalancer, log)
}

func (r *DockerMachineReconciler) reconcileNormal(ctx context.Context, machine *clusterv1.Machine, dockerMachine *infrav1.DockerMachine, externalMachine *docker.Machine, externalLoadBalancer *docker.LoadBalancer, log logr.Logger) (ctrl.Result, error) {
func (r *DockerMachineReconciler) reconcileNormal(ctx context.Context, machine *clusterv1.Machine, dockerMachine *infrav1.DockerMachine, externalMachine *docker.Machine, externalLoadBalancer *docker.LoadBalancer, log logr.Logger) (res ctrl.Result, retErr error) {
// If the DockerMachine doesn't have finalizer, add it.
controllerutil.AddFinalizer(dockerMachine, infrav1.MachineFinalizer)

Expand All @@ -176,6 +176,19 @@ func (r *DockerMachineReconciler) reconcileNormal(ctx context.Context, machine *
role = constants.ControlPlaneNodeRoleValue
}

// Defining a cleanup func that will delete a machine when there are error during provisioning, so the operation
// can be re-tried from a clean state when the next reconcile happens (in 10 seconds)
defer func() {
if retErr != nil && !dockerMachine.Spec.Bootstrapped {
log.Info(fmt.Sprintf("%v, cleaning up so we can re-provision from a clean state", retErr))
if err := externalMachine.Delete(ctx); err != nil {
log.Info("Failed to cleanup machine")
}
res = ctrl.Result{RequeueAfter: 10 * time.Second}
retErr = nil
}
}()
Comment on lines +181 to +190
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly this would delete the underlying infrastructure regardless of the returned error if the actual docker machine was never bootstrapped.

Should we try instead to add some retry logic in the container bootstrapping mechanism, or do we prefer to do it in the controller here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the error logs, when the container does not start properly we start getting weird errors like can't create pki folder, and those error does not go away after many retries :-(


if err := externalMachine.Create(ctx, role, machine.Spec.Version, dockerMachine.Spec.ExtraMounts); err != nil {
return ctrl.Result{}, errors.Wrap(err, "failed to create worker DockerMachine")
}
Expand Down
2 changes: 1 addition & 1 deletion test/infrastructure/docker/docker/machine.go
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ func (m *Machine) Create(ctx context.Context, role string, version *string, moun
}
// After creating a node we need to wait a small amount of time until crictl does not return an error.
// This fixes an issue where we try to kubeadm init too quickly after creating the container.
err = wait.PollImmediate(500*time.Millisecond, 2*time.Second, func() (bool, error) {
err = wait.PollImmediate(500*time.Millisecond, 4*time.Second, func() (bool, error) {
ps := m.container.Commander.Command("crictl", "ps")
return ps.Run(ctx) == nil, nil
})
Expand Down