Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add available condition to control plane provider contract #3779

Closed
ncdc opened this issue Oct 13, 2020 · 23 comments
Closed

Add available condition to control plane provider contract #3779

ncdc opened this issue Oct 13, 2020 · 23 comments
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ncdc
Copy link
Contributor

ncdc commented Oct 13, 2020

User Story

As a developer, I would like to know when the control plane is first reachable for a new cluster, so that I can make decisions about MachineHealthCheck.spec.nodeStartupTimeout for joining control plane machines.

Detailed Description

When trying to determine if a joining control plane machine has exceeded a MachineHealthCheck's nodeStartupTimeout, we need to delay checking for this condition until after the control plane is first reachable. In other words, a joining control plane machine must not be marked unhealthy by the MachineHealthCheck if the apiserver is unreachable / it's still be initialized. After the apiserver is reachable, the nodeStartupTimeout should most likely be measured from when the machine's infrastructure was marked ready.

This proposal is to add a new required status condition to all control plane providers, Available, that indicates the first control plane node has completed bootstrapping and the apiserver is reachable. This is currently present in the KubeadmControlPlane provider, but it's not part of the control plane provider contract.

See also #3026

/kind feature
/area control-plane
/kind api-change

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Issues or PRs related to control-plane lifecycle management kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 13, 2020
@vincepri
Copy link
Member

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Oct 13, 2020
@ncdc
Copy link
Contributor Author

ncdc commented Oct 13, 2020

Thinking a bit more, I wonder if what I really want is a Cluster condition for ControlPlaneAvailable, and the ClusterReconciler can set this when it's able to talk to the workload cluster apiserver? I don't know this necessarily or strictly needs to be in the control plane provider?

@detiber
Copy link
Member

detiber commented Oct 13, 2020

+1 to making this a condition on the Cluster

@vincepri
Copy link
Member

That's what KubeadmControlPlane actually does today (more or less), we could write a fallback to do this generically by pinging the control plane?

@ncdc
Copy link
Contributor Author

ncdc commented Oct 13, 2020

Basically what I think I want/need is to make cluster.status.controlPlaneInitialized a condition (can change the name if needed), and have it be set by the ClusterReconciler. No need to do anything in a control plane provider here. Right?

@vincepri
Copy link
Member

Initialized was a bit different than availability when we originally discussed it. If our goal is to provide something generic for saying "the control plane is available", consider something like:

  • Have an ControlPlaneAvailable condition
  • Have code that checks if the control plane reference conditions has Available set to True
  • If the control plane reference isn't available, fallback to a simple connection check

What do you think?

@ncdc
Copy link
Contributor Author

ncdc commented Oct 13, 2020

Hmm what does "available" mean? 😄 Let me come back later with some more detailed thoughts...

@vincepri
Copy link
Member

Initialized could probably go away, it's currently set when a control plane is initialized, but doesn't necessarily mean that the control plane is ready to be taking requests or that's reachable.

Available was a way to say the control plane is ready to take requests and it's operational (needs more definition), which comes a bit after initialization step

@ncdc
Copy link
Contributor Author

ncdc commented Oct 13, 2020

Going to move some of this discussion over to #3026 and will circle back here later

@detiber
Copy link
Member

detiber commented Oct 13, 2020

Initialized could probably go away, it's currently set when a control plane is initialized, but doesn't necessarily mean that the control plane is ready to be taking requests or that's reachable.

Yes, it was initially intended mostly as a stop gap for the purposes of allowing migration and continuation of support for v1alpha2 Machine-based control planes to v1alpha3 and KubeadmControlPlane.

Available was a way to say the control plane is ready to take requests and it's operational (needs more definition), which comes a bit after initialization step

My initial thoughts around this were to provide a signal that the ControlPlane (API Server) should be available for requests, so that it could block/unblock operations that require the control plane being available (such as creating worker Machines) in a way that is more accurate than Initialized, but not as strict as Ready. I would expect Available to be a subset of Ready, and should be able to switch between states (unlike Initialized).

I think Available should likely comprise the following:

  • We have a control plane endpoint (supplied by user or a control plane provider)
  • The apiserver is available through the endpoint (/healthz reports ok)

Eventually it might also be good to determine if the apiserver is in a state where resources can be created/modified, in which case it might be good to bubble up a signal from a control plane provider (if used). For KCP, this could simply bubble up that etcd has quorum. This could fall back to either assuming things are good or creating/deleting an arbitrary resource in the case that a control plane provider is not used.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 11, 2021
@vincepri
Copy link
Member

/assign @fabriziopandini
/milestone v1.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.4, v1.0 Oct 19, 2021
@vincepri vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021
@fabriziopandini fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022
@fabriziopandini
Copy link
Member

/unassign

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/triage accepted
/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 30, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024
@fabriziopandini fabriziopandini removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 16, 2024
@fabriziopandini
Copy link
Member

/assign

I think we made some progress here but we forgot to update the issue.
I will try to re-assess current status ASAP

@fabriziopandini
Copy link
Member

fabriziopandini commented Apr 24, 2024

The original ask

As a developer, I would like to know when the control plane is first reachable for a new cluster, so that I can make decisions about MachineHealthCheck.spec.nodeStartupTimeout for joining control plane machines.

Has been already addressed in

// Use the latest of the 3 times
comparisonTime := machineCreationTime
logger.V(3).Info("Determining comparison time", "machineCreationTime", machineCreationTime, "clusterInfraReadyTime", clusterInfraReady, "controlPlaneInitializedTime", controlPlaneInitialized)
if conditions.IsTrue(t.Cluster, clusterv1.ControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.Time.After(comparisonTime) {
comparisonTime = controlPlaneInitialized.Time
}
if conditions.IsTrue(t.Cluster, clusterv1.InfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.Time.After(comparisonTime) {
comparisonTime = clusterInfraReady.Time
}
logger.V(3).Info("Using comparison time", "time", comparisonTime)
by taking the latest between the 3 times:

  • machineCreationTime
  • Cluster.InfrastructureReady transition to true (comes from infraCluster.status.ready)
  • Cluster.ControlPlaneInitialized transition to true (comes from ControlPlane.status.initialized, falls back to the first time a CP machine reports nodeRef assigned when using standalone CP machines)

WRT to Initialized vs Available:

  • As of today Cluster.ControlPlaneInitialized is used everywhere.
  • Cluster.ControlPlaneInitialized derives from ControlPlane.status.initialized, and it is up to the control plane provider to determine when to set it to true (KCP currently assumes the control plane is initialized when there is a kubeadm-config map in the workload cluster, and this imply that there is a working LB and API server for the workload cluster).
  • Available condition exists only in KCP, and it has the very same meaning of KCP.status.initialized

Considered that IMO the most practical way forwards is

  • To close this issue because we don't need to change the contract, ControlPlane.status.initialized is already documented
  • To open a new issue to drop KCP.Available, or to replace it with KCP.Initialized to get rid of the less used noun

@sbueringer @chrischdi opinions?

@sbueringer
Copy link
Member

sbueringer commented Apr 25, 2024

To close this issue because we don't need to change the contract, ControlPlane.status.initialized is already documented

Sounds good

To open a new issue to drop KCP.Available, or to replace it with KCP.Initialized to get rid of the less used noun

Sounds good to me, with the next apiVersion obviously.

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

@fabriziopandini
Copy link
Member

/close

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

Agreed, added a comment on #10532:

My preference goes to introduce a KCP.Initialized condition (that mirrors ControlPlane.status.initialized)

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close

The only thing that is not ideal is that this basically goes slightly backwards on our transition from status fields to conditions (the transition doesn't actually happen at the moment though).

Agreed, added a comment on #10532:

My preference goes to introduce a KCP.Initialized condition (that mirrors ControlPlane.status.initialized)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

8 participants