🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

dlipovetsky · 2021-06-30T18:17:53Z

What steps did you take and what happened:
I deployed a workload cluster. I then deployed the CAPI controllers on the workload cluster. Finally, I tried to move the workload cluster objects from the bootstrap the workload cluster. But clusterctl move failed one of its preflight checks:

Error: unable to pivot to the to-cluster: cannot start the move operation while "/, Kind=" default/example-control-plane-mpndl is still provisioning the node

The error is created here:

cluster-api/cmd/clusterctl/client/cluster/mover.go

Lines 164 to 166 in 9bb9c95

    
           if machineObj.Status.NodeRef == nil { 
        
           	errList = append(errList, errors.Errorf("cannot start the move operation while %q %s/%s is still provisioning the node", machineObj.GroupVersionKind(), machineObj.GetNamespace(), machineObj.GetName())) 
        
           }

In this case, the preflight check failed because KubeadmControlPlane controller was scaling up the control plane, and the newest Machine did not yet have a NodeRef.

It's worth noting that this preflight check can fail under other circumstances. For example, if a MachineDeployment is scaling up.

What did you expect to happen:

I want to understand why this preflight check is required. The workload cluster is up and running. I think objects be safely moved, even if some Machine is missing a NodeRef, but I'm not certain. (@randomvariable mentioned in the 06/30/2021 CAPI meeting that this check might have been introduced to make sure that all control plane replicas had finished deploying)
This preflight check might fail at any time a new Machine is created. So, if the preflight check is required, I want to understand how to proceed when it fails:
- Should I run clusterctl move repeatedly until it passes?
- Should I perform some steps to ensure the preflight check passes? For example, I could pause reconcilation of KubeadmControlPlane, MachineDeployment, etc, so that new Machine objects are created, before running clusterctl move.

Environment:

Cluster-api version: v0.3.19
Minikube/KIND version: v0.11.1
Kubernetes version: (use kubectl version): v1.21.1
OS (e.g. from /etc/os-release):

/kind bug
/area clusterctl

@fabriziopandini

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2021-06-30T18:43:47Z

cc @arunmk

fabriziopandini · 2021-07-04T12:13:50Z

My two cents

technically, there is no clean way for clusterctl to determine if a provider in the source cluster is doing some long term operation, like e.g. provisioning a VM or some other piece of infrastructure.
If moves happens in the middle of one of the above long term operation, most probably the operation will fail because the Cluster API resources does not exist anymore in the source cluster; in the worst case, this might lead to orphaning bits of infrastructure.
the preflight checks provide a first part of the current solution to the above problem; by checking that provisioning is completed, we can reasonably assume the most common long term operations are completed; the second part is pausing controllers before moving, so we ensure no new long term operations starts while moving.

WRT to how to handle init & move, which is the primarily use case this command has been designed for, the recommended procedure is described in https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html#bootstrap--pivot , where the key step for this thread most probably is "Wait for the target management cluster to be up and running" (--> provisioned).

FYI some recent try to redefine the scope of clusterctl, including also identifying the different requirements for this using move after the initial bootstrap scenario, have not gained enough traction(see #3354)

dlipovetsky · 2021-07-06T05:03:55Z

Thank you for joining in the discussion, @fabriziopandini! 🙏

the key step for this thread most probably is "Wait for the target management cluster to be up and running" (--> provisioned).

To execute this step, I need some criteria that tells me that my cluster is "up and running." If I understand you correctly, you are saying that the criteria may be different for each cluster, and that the clusterctl preflight checks are a necessary subset of the criteria, but not the complete criteria:

technically, there is no clean way for clusterctl to determine if a provider in the source cluster is doing some long term operation, like e.g. provisioning a VM or some other piece of infrastructure.

If moves happens in the middle of one of the above long term operation, most probably the operation will fail because the Cluster API resources does not exist anymore in the source cluster; in the worst case, this might lead to orphaning bits of infrastructure.

I don't understand why this claim must be true.

AFAIK, we expect the controllers to be "re-entrant," i.e., to be able to resume no matter when they crash. (In practice, this means that the controllers' actions must be idempotent).

Is there a well-understood difference between the effect of "move" of all cluster objects, and the simultaneous crash (and restart) of all provider controllers?

(I do not see a difference, but I may be missing something important!)

If we can answer this question, I think we can make progress in this thread.

fabriziopandini · 2021-07-06T09:00:45Z

I need some criteria that tells me that my cluster is "up and running."

IMO having all the machine in a cluster provisioned is a good signal; if you want to stay on the safe side, you could also check for all the nodes to be in ready state.

I don't understand why this claim must be true.

Let's start by saying that I'm more than happy to re-open the discussion about redefining the scope of move or changing requirements if now more people are interested in it (my previous attempt failed).

WRT to the previous comment, I was not really making a claim, but just noting that it might happen to get some orphaned infrastructure, and I fully agree with you that this should not happen if re-entrancy is properly implemented by all the providers. But why take the risk if we can avoid it?

Considering ^^, the fact that we are focused on the init & move use case only (which is typically a "controlled" sequence), the intrinsic complexity of move (see e.g recent changes for supporting global credentials), I personally think that the existing pre-flight checks are still reasonable and if possible to avoid moving while thinks are changing, the better (even if this could be seen as over protection); but as I said, happy to reconsider.

dlipovetsky · 2021-07-07T17:29:22Z

IMO having all the machine in a cluster provisioned is a good signal; if you want to stay on the safe side, you could also check for all the nodes to be in ready state.

Ok, let's say this is what we recommend. How do we expect a user to get this signal? Do we expect users to implement their own solutions? Can we provide a clusterctl command? Do we expect the user run clusterctl move repeatedly until it passes?

Would adding Conditions in MachineDeployment (#3486) help?

vincepri · 2021-07-28T23:29:18Z

/milestone Next

k8s-triage-robot · 2021-10-26T23:50:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-11-26T00:30:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-12-26T01:10:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-12-26T01:10:33Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterctl Issues or PRs related to clusterctl labels Jun 30, 2021

fabriziopandini removed the kind/bug Categorizes issue or PR as related to a bug. label Jul 6, 2021

k8s-ci-robot added this to the Next milestone Jul 28, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2021

k8s-ci-robot closed this as completed Dec 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

dlipovetsky commented Jun 30, 2021

CecileRobertMichon commented Jun 30, 2021

fabriziopandini commented Jul 4, 2021

dlipovetsky commented Jul 6, 2021

fabriziopandini commented Jul 6, 2021 •

edited

Loading

dlipovetsky commented Jul 7, 2021 •

edited

Loading

vincepri commented Jul 28, 2021

k8s-triage-robot commented Oct 26, 2021

k8s-triage-robot commented Nov 26, 2021

k8s-triage-robot commented Dec 26, 2021

k8s-ci-robot commented Dec 26, 2021

🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870

Comments

dlipovetsky commented Jun 30, 2021

CecileRobertMichon commented Jun 30, 2021

fabriziopandini commented Jul 4, 2021

dlipovetsky commented Jul 6, 2021

fabriziopandini commented Jul 6, 2021 • edited Loading

dlipovetsky commented Jul 7, 2021 • edited Loading

vincepri commented Jul 28, 2021

k8s-triage-robot commented Oct 26, 2021

k8s-triage-robot commented Nov 26, 2021

k8s-triage-robot commented Dec 26, 2021

k8s-ci-robot commented Dec 26, 2021

fabriziopandini commented Jul 6, 2021 •

edited

Loading

dlipovetsky commented Jul 7, 2021 •

edited

Loading