-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch between ControlPlaneReady and Conditions.ControlPlaneReady #7099
Comments
These two fields are set slightly differently, with ControlPlaneReady being based on the I think, though, there's a couple of questions I have with the approach here, so it would be good to understand the use case.
|
|
cc @ykakarap |
If I see correctly AfterControlPlaneInitialized is called once
cluster-api/internal/controllers/cluster/cluster_controller.go Lines 497 to 500 in bbe7c2f
That nodeRef is only set if we were able to get a Node object from the workload cluster (via a client form ClusterCacheTracker)
I could be overlooking something, but looks to me like the workload cluster apiserver has to be reachable before the hook is called. |
After doing a little bit of testing here there is a significant drift between the two markers depending on network configuration: The condition Cluster I'm not sure of the different networking constraints for different providers, but I can imagine a situation where no CNI means limited API server access. We might want to align these values (or possibly deprecate the old status fields). I'm not read into the full context here, but is there any reason to keep the status 'Ready' fields around (vs conditions) in the long term? i.e. apart from backward compatibility etc. We could deprecate without an intention to remove until the next API revision to signal that the condition is the better value to rely on. |
I'm not sure if that is possible. apiserver should not depend on CNI, otherwise it would be a deadlock (at least for "external access" like from the mgmt cluster). One other data point. Our quickstart depends on the APIserver being reachable so that you can deploy CNI at the end of the provisioning. A similar workflow is the default for kubeadm afaik. |
Agreed for most situations (there's enough edge cases here that I'm sure there's some way this could work out. If this is the case should our Right now it's telling us if the nodes are ready and is different from ControlPlaneInitialized and ControlPlaneReady. I think the drift between these two is a definite weakness in the API |
Is this possible with CRD columns? Their feature set is rather limited. |
Seems like it should be possible with jsonPath - I think the important question though is whether to try to close the gap between these two values. Right now the API has a couple of values with near-identical names but different semantics, and in some of the cases they are seemingly wrong based on their names and lead to bad assumptions by consumers. |
I'm not sure if CRD columns support the full jsonPath. It's probably also not a good idea to depend on a certain condition being the first one in the array (not sure if more dynamic array element matching is supported by CRD columns). I had some problems with CRD columns in the past.
IIRC the plan was to migrate eventually to conditions. I'm not sure if we wanted to drop the bools accordingly. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten as i've implemented a similar runtime hook i was also confused about the diff in status field before the runtime hook succeed: status:
conditions:
- lastTransitionTime: "2022-12-28T06:15:56Z"
status: "True"
type: Ready
- lastTransitionTime: "2022-12-28T06:15:56Z"
status: "True"
type: ControlPlaneInitialized
- lastTransitionTime: "2022-12-28T06:15:56Z"
status: "True"
type: ControlPlaneReady
- lastTransitionTime: "2022-12-28T06:13:48Z"
status: "True"
type: InfrastructureReady
- lastTransitionTime: "2022-12-28T06:15:56Z"
message: 'error reconciling the Cluster topology: failed to call extension handlers
for hook "AfterControlPlaneInitialized.hooks.runtime.cluster.x-k8s.io": failed
to call extension handler "wait-for-cni-and-cpi.runtimesdk-test": got failure
response'
reason: TopologyReconcileFailed
severity: Error
status: "False"
type: TopologyReconciled
controlPlaneReady: false
failureDomains:
"1":
controlPlane: true
"2":
controlPlane: true
"3":
controlPlane: true
infrastructureReady: true
observedGeneration: 3
phase: Provisioned status field after the runtime hook succeed: status:
conditions:
- lastTransitionTime: "2022-12-28T06:54:44Z"
status: "True"
type: Ready
- lastTransitionTime: "2022-12-28T06:54:44Z"
status: "True"
type: ControlPlaneInitialized
- lastTransitionTime: "2022-12-28T06:54:44Z"
status: "True"
type: ControlPlaneReady
- lastTransitionTime: "2022-12-28T06:52:29Z"
status: "True"
type: InfrastructureReady
- lastTransitionTime: "2022-12-29T06:09:44Z"
status: "True"
type: TopologyReconciled
controlPlaneReady: true
failureDomains:
"1":
controlPlane: true
"2":
controlPlane: true
"3":
controlPlane: true
infrastructureReady: true
observedGeneration: 3
phase: Provisioned runtime hook impl:hook registration if err := webhookServer.AddExtensionHandler(server.ExtensionHandler{
Hook: runtimehooksv1.AfterControlPlaneInitialized,
Name: "wait-for-cni-and-cpi",
HandlerFunc: lifecycleHandler.WaitForCNIandCPI,
TimeoutSeconds: pointer.Int32(5),
FailurePolicy: toPtr(runtimehooksv1.FailurePolicyFail),
}); err != nil {
setupLog.Error(err, "error adding handler")
os.Exit(1)
} impl: pods, err := clientset.CoreV1().Pods(metav1.NamespaceSystem).List(context.TODO(), metav1.ListOptions{
LabelSelector: labels.SelectorFromSet(labels.Set{"app": "azure-cloud-controller-manager"}).String(),
})
if err != nil {
log.Error(err, "failed to list pods in namespace kube-system")
response.Status = runtimehooksv1.ResponseStatusFailure
response.Message = "failed to list pods in namespace kube-system"
return
}
log.WithName("WaitForCNIandCPI").WithValues("cluster", cluster.Name).Info(fmt.Sprintf("There are %d pods in the cluster", len(pods.Items)))
if len(pods.Items) > 0 {
for _, pod := range pods.Items {
if podutil.IsPodReady(&pod) {
log.WithName("WaitForCNIandCPI").WithValues("cluster", cluster.Name).Info(fmt.Sprintf("%s on %s is up and running", pod.Name, pod.Spec.NodeName))
response.Status = runtimehooksv1.ResponseStatusSuccess
return
}
log.WithName("WaitForCNIandCPI").WithValues("cluster", cluster.Name).Info(fmt.Sprintf("%s on %s is not ready yet", pod.Name, pod.Spec.NodeName))
}
}
response.Status = runtimehooksv1.ResponseStatusFailure
response.Message = fmt.Sprintf("There are %d pods in the cluster", len(pods.Items)) |
@killianmuldoon / @sbueringer comming from the current documentation about blocking hooks and the example implementation i've done above i expect the
Regarding the above comment. |
@bavarianbidi Are you using the I think the broad intention is to deprecate and remove the If you can share - why do you need to check these conditions / status fields when you're using the runtime hook? |
That said - I think it's probably a good idea to move the setting of |
sorry, to long ago that i remember everything from that work 🙈 - thanks for your explanation.
The runtime hook was done during a PoC to check if runtime extension could solve some of our internal issues we have when it came to cluster creation. We thought about creating some additional controllers (reconciling on the cluster CR) which take care of the condition/status field and apply additional (non CPI/CSI/CNI) addons (e.g. monitoring) to a workload cluster. So as you said, |
This sounds like exactly what the hook is intended for! But it's supposed to be non-blocking so the rest of reconciliation - for MachineDeployments etc. can continue while add-ons are initialized. It would be great to get feedback on how the PoC went and what didn't work so we could feed that back into Runtime SDK though 🙂 |
@killianmuldoon I think i have to describe my use-case more detailed. Sorry if my above comments caused some confusion. For our users it's possible to create a
Depending on the list of additional components, this works very smooth. But if the list contains components which want to create a Once all cluster mandatory components (CPI, CNI, CSI + e.g. a valid The problem with the dependency to a storageclass got solved in Kubernetes 1.26 by introducing Retroactive Default StorageClass. So the initial issue we've tried to solve is fixed in k8s but the approach with hooks looks very promising and has a lot of potential for us. With the knowledge we have now and the possible move of From a CAPI core controller point of view it really make sense to have the An alternative approach (but not implemented yet) is to hook into the |
Just to summarize (if I got it all correctly):
|
/triage accepted |
/help |
@fabriziopandini: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/priority important-longterm |
Will be addressed via #10897, please take a look at the proposal for more details. Basically we are going to rename the current ControlPlaneReady field to make clear what it stands for. /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What steps did you take and what happened:
Running CAPD with tilt from HEAD, created a new runtime extension webhoook server and added the function hook for
ControlPlaneInitialized
, I wanted to execute a few steps AFTER the Cluster being ready, first try was to read fromrequest.Cluster.Status.ControlPlaneReady
boolean to check if I can create my pod in the WL cluster. This operations never happened even if the condition of TypeControlPlaneReady
is True eventually.In the last line
request.Cluster.ControlPlaneReady
== false, and all Conditions are true, as noted this does not happen to the InfrastructureReady.Can this be for the fact I'm returning an
ReponseStatusFailure
while waiting for the cp be ready?What did you expect to happen:
To both fields be in sync when the cp is ready.
Environment:
kubectl version
): 1.24/etc/os-release
): Debian (WSL)/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: