🏃[KCP] combine health checks of scale up and down #2849

sedefsavas · 2020-04-02T15:17:34Z

What this PR does / why we need it:
This PR mov control plane and ETCD health checks from scale up/down to reconcile.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #2818 and #2753

/kind cleanup
/area control-plane

k8s-ci-robot · 2020-04-02T15:17:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sedefsavas
To complete the pull request process, please assign davidewatson
You can assign the PR to them by writing /assign @davidewatson in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

controlplane/kubeadm/controllers/controller.go

detiber · 2020-04-02T17:55:54Z

controlplane/kubeadm/controllers/scale.go

-		r.recorder.Eventf(kcp, corev1.EventTypeWarning, "ControlPlaneUnhealthy",
-			"Waiting for control plane to pass control plane health check before removing a control plane machine: %v", err)
-		return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: healthCheckFailedRequeueAfter}
-	}


These checks (in both scale up and scale down) were also gating the upgrade workflow as well and the new general check is only triggered during normal scale up/scale down operations currently.

Good catch! So my new assumption is as far as the control plane is initialized, we want to run these health checks. Moved it before the upgrade.

vincepri · 2020-04-02T18:12:45Z

controlplane/kubeadm/controllers/controller.go

@@ -320,3 +329,24 @@ func (r *KubeadmControlPlaneReconciler) ClusterToKubeadmControlPlane(o handler.M

 	return nil
 }
+
+func (r *KubeadmControlPlaneReconciler) generalHealthCheck(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane, controlPlane *internal.ControlPlane) (ctrl.Result, error) {


Suggested change

func (r *KubeadmControlPlaneReconciler) generalHealthCheck(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane, controlPlane *internal.ControlPlane) (ctrl.Result, error) {

func (r *KubeadmControlPlaneReconciler) checkHealth(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane, controlPlane *internal.ControlPlane) (ctrl.Result, error) {

vincepri · 2020-04-02T18:13:52Z

controlplane/kubeadm/controllers/controller.go

+	numMachines := len(ownedMachines)
+	// If the control plane is initialized, wait for health checks to pass to continue.
+	if numMachines > 0 {
+		result, err := r.generalHealthCheck(ctx, cluster, kcp, controlPlane)
+		if err != nil {
+			return result, err
+		}
+	}


How would this work when we can do remediation? Let's say a Machine isn't responding, if the health check fails, we won't create a new one?

sedefsavas · 2020-04-02T19:15:24Z

Closing this issue due to the concerns raised about returning early before a possible remediation.
Combining control plane and etcd health checks into one function and calling it from both scale up and down functions so that they can handle error differently. Issue #2841

[KCP] combine health checks of scale up and down

2ef3bc7

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 2, 2020

k8s-ci-robot requested review from justinsb and ncdc April 2, 2020 15:17

sedefsavas mentioned this pull request Apr 2, 2020

🏃[KCP] Recover from a manual machine deletion #2841

Merged

k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Apr 2, 2020

vincepri reviewed Apr 2, 2020

View reviewed changes

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved

detiber reviewed Apr 2, 2020

View reviewed changes

Move health check before upgrade

0682a9d

sedefsavas force-pushed the kcphealthcheck branch from 1319186 to 0682a9d Compare April 2, 2020 18:09

vincepri reviewed Apr 2, 2020

View reviewed changes

sedefsavas closed this Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏃[KCP] combine health checks of scale up and down #2849

🏃[KCP] combine health checks of scale up and down #2849

sedefsavas commented Apr 2, 2020 •

edited

Loading

k8s-ci-robot commented Apr 2, 2020

detiber Apr 2, 2020

sedefsavas Apr 2, 2020

vincepri Apr 2, 2020

vincepri Apr 2, 2020

sedefsavas commented Apr 2, 2020

	func (r KubeadmControlPlaneReconciler) generalHealthCheck(ctx context.Context, cluster clusterv1.Cluster, kcp controlplanev1.KubeadmControlPlane, controlPlane internal.ControlPlane) (ctrl.Result, error) {
	func (r KubeadmControlPlaneReconciler) checkHealth(ctx context.Context, cluster clusterv1.Cluster, kcp controlplanev1.KubeadmControlPlane, controlPlane internal.ControlPlane) (ctrl.Result, error) {

🏃[KCP] combine health checks of scale up and down #2849

🏃[KCP] combine health checks of scale up and down #2849

Conversation

sedefsavas commented Apr 2, 2020 • edited Loading

k8s-ci-robot commented Apr 2, 2020

detiber Apr 2, 2020

Choose a reason for hiding this comment

sedefsavas Apr 2, 2020

Choose a reason for hiding this comment

vincepri Apr 2, 2020

Choose a reason for hiding this comment

vincepri Apr 2, 2020

Choose a reason for hiding this comment

sedefsavas commented Apr 2, 2020

sedefsavas commented Apr 2, 2020 •

edited

Loading