⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752

ncdc · 2020-10-06T19:01:23Z

What this PR does / why we need it:
Add ControlPlaneInitializedCondition.

Remove cluster.status.controlPlaneInitialized.

Change the MachineHealthCheck logic to require the Cluster's "control
plane initialized" and "infrastructure ready" conditions to be true
before proceeding to determine if a target is unhealhty.

We can't look just at a Machine's last updated time when determining if
the Machine has exceeded the node startup timeout. A node can't
bootstrap until after the cluster's infrastructure is ready and the
control plane is initialized, and we need to base node startup timeout
on the latter of machine creation time, control plane initialized time,
and cluster infrastructure ready time.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3026
Fixes #3798

vincepri · 2020-10-06T19:27:40Z

/assign
/milestone v0.3.11

Going to review later today

ncdc · 2020-10-06T19:29:02Z

controllers/machinehealthcheck_targets_test.go

 	"sigs.k8s.io/controller-runtime/pkg/log"
 )

-func TestGetTargetsFromMHC(t *testing.T) {


I've removed TestGetTargetsFromMHC as it's not really feasible to test in isolation with a live client (from envtest) when all the controllers are running (from suite_test). I have tried to move all the test cases from here into TestMachineHealthCheck_Reconcile.

ncdc · 2020-10-06T19:29:31Z

controllers/machinehealthcheck_targets_test.go

 			reconciler := &MachineHealthCheckReconciler{
-				Client:   k8sClient,


I've removed setting the client because healthCheckTargets() doesn't need one.

fabriziopandini · 2020-10-07T10:06:13Z

controllers/machinehealthcheck_targets.go

-		// status not updated yet
-		if t.Machine.Status.LastUpdated == nil {
-			return false, timeoutForMachineToHaveNode
+		// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required.


this should probably be ControlPlaneAvailable condition - when the first kubeadm API server is up and running -, assuming we are going to make this part of the contract (now AFAIK it exists only in KCP).

Yeah, that makes more sense. I'll change it. And we should definitely make it a control plane provider contract requirement in v1alpha4.

Has it been documented/decided anywhere that controlPlaneRef will be a requirement? It seems like that will create a barrier to phased adoption of Cluster API, though I suppose one could create the equivalent of a null controlplane provider

Just to be clear, I'm not suggesting that we keep backward compat support for non-control plane provider managed control plane machines, just questioning the hard requirement on having a control plane provider implementation in place.

@vincepri can you file an issue proposing controlPlaneRef as a requirement in v1alpha4

fabriziopandini · 2020-10-07T10:13:41Z

controllers/machinehealthcheck_targets.go

+		// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required.
+		// We can't do this yet because ControlPlaneReadyCondition is only set when you're using a control plane provider,
+		// and that is optional in v1alpha3.
+		if !conditions.Has(t.Cluster, clusterv1.InfrastructureReadyCondition) || conditions.IsFalse(t.Cluster, clusterv1.InfrastructureReadyCondition) {


If I got the problem right, this is accurate only for the first controlplane machine, but for all the other machines the actual startup time triggers when status.ControlPlaneInitialized==True (or ControlPlaneAvailable condition = True as suggested above).
However, I'm wondering if this assumption is somehow related to how kubeadm works and can't be used as a generic rule...

Are you talking about KCP/MHC remediation?

In this PR's current form, given that MHC only remediates non-control plane Machines, this should be accurate for all non-control plane Machines.

Or am I misunderstanding what you're saying?

How is this going to impact the remediation support for control plane nodes?

I need to take that into account, since when we switched to using conditions for MHC, we got rid of the logic that only has MHC work on Machines owned by MachineSets and never control plane machines.

vincepri

LGTM

vincepri · 2020-10-08T19:40:04Z

controllers/machinehealthcheck_controller_test.go

 	. "github.com/onsi/gomega"
+	capierrors "sigs.k8s.io/cluster-api/errors"
+	"sigs.k8s.io/cluster-api/util/patch"

 	corev1 "k8s.io/api/core/v1"


Suggested change

. "github.com/onsi/gomega"

capierrors "sigs.k8s.io/cluster-api/errors"

"sigs.k8s.io/cluster-api/util/patch"

corev1 "k8s.io/api/core/v1"

. "github.com/onsi/gomega"

capierrors "sigs.k8s.io/cluster-api/errors"

"sigs.k8s.io/cluster-api/util/patch"

corev1 "k8s.io/api/core/v1"

vincepri · 2020-10-08T19:40:33Z

controllers/machinehealthcheck_controller_test.go

@@ -175,6 +177,54 @@ func TestMachineHealthCheck_Reconcile(t *testing.T) {
 		))
 	})

+	t.Run("it ignores Machines not matching the label selector", func(t *testing.T) {
+		g := NewWithT(t)
+		ctx := context.TODO()


Suggested change

ctx := context.TODO()

There should be a global context already defined, if there isn't you can definitely ignore this

vincepri · 2020-10-08T19:42:32Z

controllers/machinehealthcheck_targets.go

+		// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required.
+		// We can't do this yet because ControlPlaneReadyCondition is only set when you're using a control plane provider,
+		// and that is optional in v1alpha3.
+		if !conditions.Has(t.Cluster, clusterv1.InfrastructureReadyCondition) || conditions.IsFalse(t.Cluster, clusterv1.InfrastructureReadyCondition) {


How is this going to impact the remediation support for control plane nodes?

vincepri · 2020-10-09T19:01:21Z

Changed the base branch to be release-0.3, we'll have to forward-port these changes to v0.4 later

ncdc · 2020-10-13T14:41:02Z

@vincepri any objections to resetting to main and I can backport to 0.3 after it's merged?

vincepri · 2020-10-13T14:51:00Z

Sure, up to you

detiber · 2020-10-13T15:52:03Z

controllers/machinehealthcheck_targets.go

-		// status not updated yet
-		if t.Machine.Status.LastUpdated == nil {
-			return false, timeoutForMachineToHaveNode
+		// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required.


Has it been documented/decided anywhere that controlPlaneRef will be a requirement? It seems like that will create a barrier to phased adoption of Cluster API, though I suppose one could create the equivalent of a null controlplane provider

detiber · 2020-10-13T15:57:34Z

controllers/machinehealthcheck_targets.go

+		infraReadyTime := conditions.GetLastTransitionTime(t.Cluster, clusterv1.InfrastructureReadyCondition)
+		if infraReadyTime.Add(timeoutForMachineToHaveNode).Before(now) {


How would this affect Machines that were provisioned long after the Cluster infrastructure is initialized, should this be updated to compare both the time since infraReadyTime and the machine last updated time against unhealthy duration to account for both cases?

Alternatively, do we need a net new condition on a machine to indicate that provisioning has been started that can accurately be checked against rather than the approximations we are currently using?

Yeah I need to take that into account. This is on hold until I file a new issue about adding a condition for control plane available.

vincepri · 2020-11-10T16:37:46Z

/milestone v0.4.0

vincepri · 2021-01-21T20:15:41Z

/retest

vincepri · 2021-01-21T20:21:17Z

/hold

for @fabriziopandini review

vincepri · 2021-03-04T18:46:13Z

/retest

vincepri · 2021-03-04T18:47:26Z

/hold cancel

vincepri

/lgtm

ncdc · 2021-03-08T15:12:13Z

@fabriziopandini @vincepri 🤞 this should be ready to go

Signed-off-by: Andy Goldstein <[email protected]>

Change the MachineHealthCheck logic to require the Cluster's "control plane initialized" and "infrastructure ready" conditions to be true before proceeding to determine if a target is unhealhty. We can't look just at a Machine's last updated time when determining if the Machine has exceeded the node startup timeout. A node can't bootstrap until after the cluster's infrastructure is ready and the control plane is initialized, and we need to base node startup timeout on the latter of machine creation time, control plane initialized time, and cluster infrastructure ready time. Signed-off-by: Andy Goldstein <[email protected]>

Add optional "hub after" and "spoke after" mutation functions to the conversion tests to support things like removing fields that were added during the conversion process that will cause the equality check to fail. Signed-off-by: Andy Goldstein <[email protected]>

k8s-ci-robot · 2021-03-11T15:51:32Z

@ncdc: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-apidiff-main	`ff383ec`	link	`/test pull-cluster-api-apidiff-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ncdc · 2021-03-11T15:52:51Z

/test pull-cluster-api-test-main

ncdc · 2021-03-11T16:00:41Z

@vincepri @fabriziopandini alright, looks like the tests are back to passing. PTAL again 😄

vincepri

/approve
/lgtm

k8s-ci-robot · 2021-03-11T16:02:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from benmoss and CecileRobertMichon October 6, 2020 19:01

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 6, 2020

k8s-ci-robot assigned vincepri Oct 6, 2020

k8s-ci-robot added this to the v0.3.11 milestone Oct 6, 2020

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 9e32b92 to 3790705 Compare October 6, 2020 19:27

ncdc commented Oct 6, 2020

View reviewed changes

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 3790705 to a79312d Compare October 6, 2020 19:29

fabriziopandini reviewed Oct 7, 2020

View reviewed changes

vincepri approved these changes Oct 8, 2020

View reviewed changes

ncdc marked this pull request as draft October 8, 2020 19:48

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 8, 2020

vincepri changed the base branch from master to release-0.3 October 9, 2020 19:01

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2020

ncdc changed the base branch from release-0.3 to master October 13, 2020 14:55

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 13, 2020

detiber reviewed Oct 13, 2020

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2020

vincepri mentioned this pull request Oct 14, 2020

MachineHealthCheck will fail until Cluster is Provisioned #3792

Closed

k8s-ci-robot modified the milestones: v0.3.11, v0.4.0 Nov 10, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2021

fabriziopandini mentioned this pull request Mar 4, 2021

Gate MHC's machine timer on the status of certain pods in the cluster #4254

Closed

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 8a85616 to 0ed9949 Compare March 4, 2021 18:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021

vincepri approved these changes Mar 4, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 0ed9949 to 20fbe52 Compare March 5, 2021 18:13

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2021

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 20fbe52 to 44f71b3 Compare March 5, 2021 20:20

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 44f71b3 to ae92f28 Compare March 9, 2021 22:30

Add ControlPlaneInitializedCondition

2e64a16

Signed-off-by: Andy Goldstein <[email protected]>

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from ae92f28 to 1ff5a97 Compare March 11, 2021 15:11

ncdc added 3 commits March 11, 2021 10:45

remove cluster.status.controlPlaneInitialized

d1032c6

Signed-off-by: Andy Goldstein <[email protected]>

ncdc force-pushed the mhc-use-conditions-for-node-startup-timeout branch from 1ff5a97 to ff383ec Compare March 11, 2021 15:46

vincepri approved these changes Mar 11, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021

k8s-ci-robot merged commit 1f07103 into kubernetes-sigs:master Mar 11, 2021

CecileRobertMichon mentioned this pull request Mar 23, 2021

MHC E2E tests are failing #4361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752

⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752

ncdc commented Oct 6, 2020 •

edited

Loading

vincepri commented Oct 6, 2020

ncdc Oct 6, 2020

ncdc Oct 6, 2020

fabriziopandini Oct 7, 2020

ncdc Oct 8, 2020

detiber Oct 13, 2020

detiber Oct 13, 2020

ncdc Oct 13, 2020

fabriziopandini Oct 7, 2020

ncdc Oct 8, 2020

vincepri Oct 8, 2020

ncdc Oct 8, 2020

vincepri left a comment

vincepri Oct 8, 2020

vincepri Oct 8, 2020

vincepri Oct 8, 2020

vincepri commented Oct 9, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

detiber Oct 13, 2020

detiber Oct 13, 2020

ncdc Oct 13, 2020

vincepri commented Nov 10, 2020

vincepri commented Jan 21, 2021

vincepri commented Jan 21, 2021

vincepri commented Mar 4, 2021

vincepri commented Mar 4, 2021

vincepri left a comment

ncdc commented Mar 8, 2021

k8s-ci-robot commented Mar 11, 2021 •

edited

Loading

ncdc commented Mar 11, 2021

ncdc commented Mar 11, 2021

vincepri left a comment

k8s-ci-robot commented Mar 11, 2021

		reconciler := &MachineHealthCheckReconciler{
		Client: k8sClient,

		infraReadyTime := conditions.GetLastTransitionTime(t.Cluster, clusterv1.InfrastructureReadyCondition)
		if infraReadyTime.Add(timeoutForMachineToHaveNode).Before(now) {

⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752

⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752

Conversation

ncdc commented Oct 6, 2020 • edited Loading

vincepri commented Oct 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Oct 9, 2020

ncdc commented Oct 13, 2020

vincepri commented Oct 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Nov 10, 2020

vincepri commented Jan 21, 2021

vincepri commented Jan 21, 2021

vincepri commented Mar 4, 2021

vincepri commented Mar 4, 2021

vincepri left a comment

Choose a reason for hiding this comment

ncdc commented Mar 8, 2021

k8s-ci-robot commented Mar 11, 2021 • edited Loading

ncdc commented Mar 11, 2021

ncdc commented Mar 11, 2021

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 11, 2021

ncdc commented Oct 6, 2020 •

edited

Loading

k8s-ci-robot commented Mar 11, 2021 •

edited

Loading