Add available condition to control plane provider contract #3779

ncdc · 2020-10-13T16:21:57Z

User Story

As a developer, I would like to know when the control plane is first reachable for a new cluster, so that I can make decisions about MachineHealthCheck.spec.nodeStartupTimeout for joining control plane machines.

Detailed Description

When trying to determine if a joining control plane machine has exceeded a MachineHealthCheck's nodeStartupTimeout, we need to delay checking for this condition until after the control plane is first reachable. In other words, a joining control plane machine must not be marked unhealthy by the MachineHealthCheck if the apiserver is unreachable / it's still be initialized. After the apiserver is reachable, the nodeStartupTimeout should most likely be measured from when the machine's infrastructure was marked ready.

This proposal is to add a new required status condition to all control plane providers, Available, that indicates the first control plane node has completed bootstrapping and the apiserver is reachable. This is currently present in the KubeadmControlPlane provider, but it's not part of the control plane provider contract.

See also #3026

/kind feature
/area control-plane
/kind api-change

The text was updated successfully, but these errors were encountered:

vincepri · 2020-10-13T16:22:19Z

/milestone v0.4.0

ncdc · 2020-10-13T16:33:47Z

Thinking a bit more, I wonder if what I really want is a Cluster condition for ControlPlaneAvailable, and the ClusterReconciler can set this when it's able to talk to the workload cluster apiserver? I don't know this necessarily or strictly needs to be in the control plane provider?

detiber · 2020-10-13T16:35:55Z

+1 to making this a condition on the Cluster

vincepri · 2020-10-13T16:37:06Z

That's what KubeadmControlPlane actually does today (more or less), we could write a fallback to do this generically by pinging the control plane?

ncdc · 2020-10-13T16:43:41Z

Basically what I think I want/need is to make cluster.status.controlPlaneInitialized a condition (can change the name if needed), and have it be set by the ClusterReconciler. No need to do anything in a control plane provider here. Right?

vincepri · 2020-10-13T16:48:43Z

Initialized was a bit different than availability when we originally discussed it. If our goal is to provide something generic for saying "the control plane is available", consider something like:

Have an ControlPlaneAvailable condition
Have code that checks if the control plane reference conditions has Available set to True
If the control plane reference isn't available, fallback to a simple connection check

What do you think?

ncdc · 2020-10-13T16:55:20Z

Hmm what does "available" mean? 😄 Let me come back later with some more detailed thoughts...

vincepri · 2020-10-13T17:25:10Z

Initialized could probably go away, it's currently set when a control plane is initialized, but doesn't necessarily mean that the control plane is ready to be taking requests or that's reachable.

Available was a way to say the control plane is ready to take requests and it's operational (needs more definition), which comes a bit after initialization step

ncdc · 2020-10-13T17:30:53Z

Going to move some of this discussion over to #3026 and will circle back here later

detiber · 2020-10-13T17:47:49Z

Initialized could probably go away, it's currently set when a control plane is initialized, but doesn't necessarily mean that the control plane is ready to be taking requests or that's reachable.

Yes, it was initially intended mostly as a stop gap for the purposes of allowing migration and continuation of support for v1alpha2 Machine-based control planes to v1alpha3 and KubeadmControlPlane.

Available was a way to say the control plane is ready to take requests and it's operational (needs more definition), which comes a bit after initialization step

My initial thoughts around this were to provide a signal that the ControlPlane (API Server) should be available for requests, so that it could block/unblock operations that require the control plane being available (such as creating worker Machines) in a way that is more accurate than Initialized, but not as strict as Ready. I would expect Available to be a subset of Ready, and should be able to switch between states (unlike Initialized).

I think Available should likely comprise the following:

We have a control plane endpoint (supplied by user or a control plane provider)
The apiserver is available through the endpoint (/healthz reports ok)

Eventually it might also be good to determine if the apiserver is in a state where resources can be created/modified, in which case it might be good to bubble up a signal from a control plane provider (if used). For KCP, this could simply bubble up that etcd has quorum. This could fall back to either assuming things are good or creating/deleting an arbitrary resource in the case that a control plane provider is not used.

fejta-bot · 2021-01-11T18:16:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2021-01-11T18:18:45Z

/lifecycle frozen

vincepri · 2021-10-19T14:17:45Z

/assign @fabriziopandini
/milestone v1.0

fabriziopandini · 2022-05-10T15:57:59Z

/unassign

fabriziopandini · 2022-09-30T19:38:26Z

/triage accepted
/help

k8s-ci-robot · 2022-09-30T19:38:27Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-01-19T00:58:52Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-04-12T14:40:34Z

/priority important-longterm

fabriziopandini · 2024-04-22T12:55:01Z

/assign

I think we made some progress here but we forgot to update the issue.
I will try to re-assess current status ASAP

fabriziopandini · 2024-04-24T11:25:11Z

The original ask

As a developer, I would like to know when the control plane is first reachable for a new cluster, so that I can make decisions about MachineHealthCheck.spec.nodeStartupTimeout for joining control plane machines.

Has been already addressed in

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 149 to 158 in c19ca28

    
           // Use the latest of the 3 times 
        
           comparisonTime := machineCreationTime 
        
           logger.V(3).Info("Determining comparison time", "machineCreationTime", machineCreationTime, "clusterInfraReadyTime", clusterInfraReady, "controlPlaneInitializedTime", controlPlaneInitialized) 
        
           if conditions.IsTrue(t.Cluster, clusterv1.ControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.Time.After(comparisonTime) { 
        
           	comparisonTime = controlPlaneInitialized.Time 
        
           } 
        
           if conditions.IsTrue(t.Cluster, clusterv1.InfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.Time.After(comparisonTime) { 
        
           	comparisonTime = clusterInfraReady.Time 
        
           } 
        
           logger.V(3).Info("Using comparison time", "time", comparisonTime)

by taking the latest between the 3 times:

machineCreationTime
Cluster.InfrastructureReady transition to true (comes from infraCluster.status.ready)
Cluster.ControlPlaneInitialized transition to true (comes from ControlPlane.status.initialized, falls back to the first time a CP machine reports nodeRef assigned when using standalone CP machines)

WRT to Initialized vs Available:

As of today Cluster.ControlPlaneInitialized is used everywhere.
Cluster.ControlPlaneInitialized derives from ControlPlane.status.initialized, and it is up to the control plane provider to determine when to set it to true (KCP currently assumes the control plane is initialized when there is a kubeadm-config map in the workload cluster, and this imply that there is a working LB and API server for the workload cluster).
Available condition exists only in KCP, and it has the very same meaning of KCP.status.initialized

Considered that IMO the most practical way forwards is

To close this issue because we don't need to change the contract, ControlPlane.status.initialized is already documented
To open a new issue to drop KCP.Available, or to replace it with KCP.Initialized to get rid of the less used noun

@sbueringer @chrischdi opinions?

sbueringer · 2024-04-25T11:55:54Z

To close this issue because we don't need to change the contract, ControlPlane.status.initialized is already documented

Sounds good

To open a new issue to drop KCP.Available, or to replace it with KCP.Initialized to get rid of the less used noun

Sounds good to me, with the next apiVersion obviously.

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

fabriziopandini · 2024-04-29T13:27:33Z

/close

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

Agreed, added a comment on #10532:

My preference goes to introduce a KCP.Initialized condition (that mirrors ControlPlane.status.initialized)

k8s-ci-robot · 2024-04-29T13:27:37Z

@fabriziopandini: Closing this issue.

In response to this:

/close

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

Agreed, added a comment on #10532:

My preference goes to introduce a KCP.Initialized condition (that mirrors ControlPlane.status.initialized)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Issues or PRs related to control-plane lifecycle management kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 13, 2020

k8s-ci-robot added this to the v0.4.0 milestone Oct 13, 2020

This was referenced Oct 13, 2020

MachineHealthCheck issues handling machines right after the control plane becomes accessible #3026

Closed

Add Cluster ControlPlaneInitialized condition #3798

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 11, 2021

k8s-ci-robot assigned fabriziopandini Oct 19, 2021

k8s-ci-robot modified the milestones: v0.4, v1.0 Oct 19, 2021

vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021

fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022

k8s-ci-robot unassigned fabriziopandini May 10, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 30, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024

fabriziopandini removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 16, 2024

k8s-ci-robot assigned fabriziopandini Apr 22, 2024

fabriziopandini mentioned this issue Apr 29, 2024

Deprecate KCP.Available condition #10532

Closed

k8s-ci-robot closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add available condition to control plane provider contract #3779

Add available condition to control plane provider contract #3779

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

detiber commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

detiber commented Oct 13, 2020

fejta-bot commented Jan 11, 2021

vincepri commented Jan 11, 2021

vincepri commented Oct 19, 2021

fabriziopandini commented May 10, 2022

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 22, 2024

fabriziopandini commented Apr 24, 2024 •

edited

Loading

sbueringer commented Apr 25, 2024 •

edited

Loading

fabriziopandini commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

Add available condition to control plane provider contract #3779

Add available condition to control plane provider contract #3779

Comments

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

detiber commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

ncdc commented Oct 13, 2020

detiber commented Oct 13, 2020

fejta-bot commented Jan 11, 2021

vincepri commented Jan 11, 2021

vincepri commented Oct 19, 2021

fabriziopandini commented May 10, 2022

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022

Guidelines

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 22, 2024

fabriziopandini commented Apr 24, 2024 • edited Loading

sbueringer commented Apr 25, 2024 • edited Loading

fabriziopandini commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

fabriziopandini commented Apr 24, 2024 •

edited

Loading

sbueringer commented Apr 25, 2024 •

edited

Loading