-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752
⚠️ MHC: require control plane initialized, cluster infrastructure ready for node startup timeout #3752
Conversation
/assign Going to review later today |
9e32b92
to
3790705
Compare
"sigs.k8s.io/controller-runtime/pkg/log" | ||
) | ||
|
||
func TestGetTargetsFromMHC(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed TestGetTargetsFromMHC as it's not really feasible to test in isolation with a live client (from envtest) when all the controllers are running (from suite_test). I have tried to move all the test cases from here into TestMachineHealthCheck_Reconcile.
reconciler := &MachineHealthCheckReconciler{ | ||
Client: k8sClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed setting the client because healthCheckTargets() doesn't need one.
3790705
to
a79312d
Compare
// status not updated yet | ||
if t.Machine.Status.LastUpdated == nil { | ||
return false, timeoutForMachineToHaveNode | ||
// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be ControlPlaneAvailable condition - when the first kubeadm API server is up and running -, assuming we are going to make this part of the contract (now AFAIK it exists only in KCP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes more sense. I'll change it. And we should definitely make it a control plane provider contract requirement in v1alpha4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has it been documented/decided anywhere that controlPlaneRef will be a requirement? It seems like that will create a barrier to phased adoption of Cluster API, though I suppose one could create the equivalent of a null controlplane provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I'm not suggesting that we keep backward compat support for non-control plane provider managed control plane machines, just questioning the hard requirement on having a control plane provider implementation in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri can you file an issue proposing controlPlaneRef as a requirement in v1alpha4
// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required. | ||
// We can't do this yet because ControlPlaneReadyCondition is only set when you're using a control plane provider, | ||
// and that is optional in v1alpha3. | ||
if !conditions.Has(t.Cluster, clusterv1.InfrastructureReadyCondition) || conditions.IsFalse(t.Cluster, clusterv1.InfrastructureReadyCondition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got the problem right, this is accurate only for the first controlplane machine, but for all the other machines the actual startup time triggers when status.ControlPlaneInitialized==True (or ControlPlaneAvailable condition = True as suggested above).
However, I'm wondering if this assumption is somehow related to how kubeadm works and can't be used as a generic rule...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about KCP/MHC remediation?
In this PR's current form, given that MHC only remediates non-control plane Machines, this should be accurate for all non-control plane Machines.
Or am I misunderstanding what you're saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this going to impact the remediation support for control plane nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take that into account, since when we switched to using conditions for MHC, we got rid of the logic that only has MHC work on Machines owned by MachineSets and never control plane machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
. "github.com/onsi/gomega" | ||
capierrors "sigs.k8s.io/cluster-api/errors" | ||
"sigs.k8s.io/cluster-api/util/patch" | ||
|
||
corev1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. "github.com/onsi/gomega" | |
capierrors "sigs.k8s.io/cluster-api/errors" | |
"sigs.k8s.io/cluster-api/util/patch" | |
corev1 "k8s.io/api/core/v1" | |
. "github.com/onsi/gomega" | |
capierrors "sigs.k8s.io/cluster-api/errors" | |
"sigs.k8s.io/cluster-api/util/patch" | |
corev1 "k8s.io/api/core/v1" |
@@ -175,6 +177,54 @@ func TestMachineHealthCheck_Reconcile(t *testing.T) { | |||
)) | |||
}) | |||
|
|||
t.Run("it ignores Machines not matching the label selector", func(t *testing.T) { | |||
g := NewWithT(t) | |||
ctx := context.TODO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx := context.TODO() |
There should be a global context already defined, if there isn't you can definitely ignore this
// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required. | ||
// We can't do this yet because ControlPlaneReadyCondition is only set when you're using a control plane provider, | ||
// and that is optional in v1alpha3. | ||
if !conditions.Has(t.Cluster, clusterv1.InfrastructureReadyCondition) || conditions.IsFalse(t.Cluster, clusterv1.InfrastructureReadyCondition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this going to impact the remediation support for control plane nodes?
Changed the base branch to be release-0.3, we'll have to forward-port these changes to v0.4 later |
@vincepri any objections to resetting to main and I can backport to 0.3 after it's merged? |
Sure, up to you |
// status not updated yet | ||
if t.Machine.Status.LastUpdated == nil { | ||
return false, timeoutForMachineToHaveNode | ||
// TODO change to checking ControlPlaneReadyCondition in v1alpha4, when cluster.spec.controlPlaneRef will be required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has it been documented/decided anywhere that controlPlaneRef will be a requirement? It seems like that will create a barrier to phased adoption of Cluster API, though I suppose one could create the equivalent of a null controlplane provider
infraReadyTime := conditions.GetLastTransitionTime(t.Cluster, clusterv1.InfrastructureReadyCondition) | ||
if infraReadyTime.Add(timeoutForMachineToHaveNode).Before(now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this affect Machines that were provisioned long after the Cluster infrastructure is initialized, should this be updated to compare both the time since infraReadyTime and the machine last updated time against unhealthy duration to account for both cases?
Alternatively, do we need a net new condition on a machine to indicate that provisioning has been started that can accurately be checked against rather than the approximations we are currently using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I need to take that into account. This is on hold until I file a new issue about adding a condition for control plane available.
/milestone v0.4.0 |
/retest |
/hold for @fabriziopandini review |
/retest |
/hold cancel |
8a85616
to
0ed9949
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
0ed9949
to
20fbe52
Compare
20fbe52
to
44f71b3
Compare
@fabriziopandini @vincepri 🤞 this should be ready to go |
44f71b3
to
ae92f28
Compare
Signed-off-by: Andy Goldstein <[email protected]>
ae92f28
to
1ff5a97
Compare
Signed-off-by: Andy Goldstein <[email protected]>
Change the MachineHealthCheck logic to require the Cluster's "control plane initialized" and "infrastructure ready" conditions to be true before proceeding to determine if a target is unhealhty. We can't look just at a Machine's last updated time when determining if the Machine has exceeded the node startup timeout. A node can't bootstrap until after the cluster's infrastructure is ready and the control plane is initialized, and we need to base node startup timeout on the latter of machine creation time, control plane initialized time, and cluster infrastructure ready time. Signed-off-by: Andy Goldstein <[email protected]>
Add optional "hub after" and "spoke after" mutation functions to the conversion tests to support things like removing fields that were added during the conversion process that will cause the equality check to fail. Signed-off-by: Andy Goldstein <[email protected]>
1ff5a97
to
ff383ec
Compare
@ncdc: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test pull-cluster-api-test-main |
@vincepri @fabriziopandini alright, looks like the tests are back to passing. PTAL again 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Add ControlPlaneInitializedCondition.
Remove cluster.status.controlPlaneInitialized.
Change the MachineHealthCheck logic to require the Cluster's "control
plane initialized" and "infrastructure ready" conditions to be true
before proceeding to determine if a target is unhealhty.
We can't look just at a Machine's last updated time when determining if
the Machine has exceeded the node startup timeout. A node can't
bootstrap until after the cluster's infrastructure is ready and the
control plane is initialized, and we need to base node startup timeout
on the latter of machine creation time, control plane initialized time,
and cluster infrastructure ready time.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #3026
Fixes #3798