-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 clusterctl move preflight check can fail even if the target cluster is up and running #4870
Comments
cc @arunmk |
My two cents
WRT to how to handle init & move, which is the primarily use case this command has been designed for, the recommended procedure is described in https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html#bootstrap--pivot , where the key step for this thread most probably is "Wait for the target management cluster to be up and running" (--> provisioned). FYI some recent try to redefine the scope of clusterctl, including also identifying the different requirements for this using move after the initial bootstrap scenario, have not gained enough traction(see #3354) |
Thank you for joining in the discussion, @fabriziopandini! 🙏
To execute this step, I need some criteria that tells me that my cluster is "up and running." If I understand you correctly, you are saying that the criteria may be different for each cluster, and that the clusterctl preflight checks are a necessary subset of the criteria, but not the complete criteria:
I don't understand why this claim must be true. AFAIK, we expect the controllers to be "re-entrant," i.e., to be able to resume no matter when they crash. (In practice, this means that the controllers' actions must be idempotent). Is there a well-understood difference between the effect of "move" of all cluster objects, and the simultaneous crash (and restart) of all provider controllers? (I do not see a difference, but I may be missing something important!) If we can answer this question, I think we can make progress in this thread. |
IMO having all the machine in a cluster provisioned is a good signal; if you want to stay on the safe side, you could also check for all the nodes to be in ready state.
Let's start by saying that I'm more than happy to re-open the discussion about redefining the scope of move or changing requirements if now more people are interested in it (my previous attempt failed). WRT to the previous comment, I was not really making a claim, but just noting that it might happen to get some orphaned infrastructure, and I fully agree with you that this should not happen if re-entrancy is properly implemented by all the providers. But why take the risk if we can avoid it? Considering ^^, the fact that we are focused on the init & move use case only (which is typically a "controlled" sequence), the intrinsic complexity of move (see e.g recent changes for supporting global credentials), I personally think that the existing pre-flight checks are still reasonable and if possible to avoid moving while thinks are changing, the better (even if this could be seen as over protection); but as I said, happy to reconsider. |
Ok, let's say this is what we recommend. How do we expect a user to get this signal? Do we expect users to implement their own solutions? Can we provide a Would adding Conditions in MachineDeployment (#3486) help? |
/milestone Next |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
I deployed a workload cluster. I then deployed the CAPI controllers on the workload cluster. Finally, I tried to move the workload cluster objects from the bootstrap the workload cluster. But
clusterctl move
failed one of its preflight checks:The error is created here:
cluster-api/cmd/clusterctl/client/cluster/mover.go
Lines 164 to 166 in 9bb9c95
In this case, the preflight check failed because KubeadmControlPlane controller was scaling up the control plane, and the newest Machine did not yet have a NodeRef.
It's worth noting that this preflight check can fail under other circumstances. For example, if a MachineDeployment is scaling up.
What did you expect to happen:
I want to understand why this preflight check is required. The workload cluster is up and running. I think objects be safely moved, even if some Machine is missing a NodeRef, but I'm not certain. (@randomvariable mentioned in the 06/30/2021 CAPI meeting that this check might have been introduced to make sure that all control plane replicas had finished deploying)
This preflight check might fail at any time a new Machine is created. So, if the preflight check is required, I want to understand how to proceed when it fails:
clusterctl move
repeatedly until it passes?clusterctl move
.Environment:
kubectl version
): v1.21.1/etc/os-release
):/kind bug
/area clusterctl
@fabriziopandini
The text was updated successfully, but these errors were encountered: