CAPA control plane machine health checks #1978

alex-dabija · 2023-02-08T09:09:54Z

Story

-As a cluster admin, I want the control plane nodes to be recreated if basic machine health checks fail in order to improve the stability of CAPA clusters.

Towards epic.

Background

CAPA clusters with control plane machines in invalid states fail to recover because there are no health checks configured.

CAPI supports machine health checks for control plane nodes and they need to be configured for CAPA clusters.

Links

bavarianbidi · 2023-02-20T07:14:37Z

my 2 cents:

contrary to my initial concerns about having MachineHealthChecks for KCP the current implementation in CAPI is strived to not break an existing etcdCluster (result of func canSafelyRemoveEtcdMember from kcp remediation).

The only "critical" path i still see is the fact that if an etcd cluster is "broken" and a human tries to fix the etcd cluster by doing some manual steps, the MachineHealthCheck could kick in and rotate a machine you're still relying on.

Kudos to Erkan for creating this great list of Scenarios:

Example Scenarios (tested in Add MachineHealthCheck for all nodes in Openstack):
Setup: 1 Control Plane + 3 Infra
Case: The control plane node is unhealthy
Behavior: No remediation because of etcd quorum loss.

Setup: 1 Control Plane + 3 Infra
Case: One infra node is unhealthy
Behavior: Infra node will be remediated

Setup: 1 Control Plane + 3 Infra
Case: 2 infra nodes are unhealthy
Behavior: No remediation because of maxUnhealthy limit (50% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: One control plane node is unhealthy
Behavior: Control plane node will be remediated

Setup: 3 Control Plane + 3 Infra
Case: One control plane, one infra node are unhealthy
Behavior: Both will be remediated (33% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: Two infra nodes are unhealthy
Behavior: Both will be remediated (33% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: Two control plane nodes are unhealthy
Behavior: No remediation because of etcd quorum loss.

Open issues for other providers:

erkanerol · 2023-02-20T07:26:13Z

The only "critical" path i still see is the fact that if an etcd cluster is "broken" and a human tries to fix the etcd cluster by doing some manual steps, the MachineHealthCheck could kick in and rotate a machine you're still relying on.

To solve this issue, we were disabling/enabling machine health checks while fixing the cluster manually.

# to disable MachineHealthCheck temporarily
function annotate_machines_to_disable_remediation(){
    local namespace="$1"
    local cluster="$2"
    kubectl -n "$namespace" annotate machine -l cluster.x-k8s.io/cluster-name="$cluster" "cluster.x-k8s.io/skip-remediation"=""
}

function deannotate_machines_to_enable_remediation(){
    local namespace="$1"
    local cluster="$2"
    kubectl -n "$namespace" annotate machine -l cluster.x-k8s.io/cluster-name="$cluster" "cluster.x-k8s.io/skip-remediation"-
}

See https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/remove-errors-from-capi-capo-crs/

The biggest issue we had in CAPO was that there was a bug in CAPO controller that adds transient errors to infra CRs, which are propagated to CAPI CRs too. We had to delete them manually to fix clusters because of the maxUnhealthy limit. See https://github.com/giantswarm/giantswarm/issues/22443

fiunchinho · 2023-02-27T10:17:52Z

@calvix @erkanerol @bavarianbidi do you think having two different MachineHealthChecks (one for the control plane and another one for the workers) is needed / better? Any issues re-using the same MachineHealthCheck for both the CP and the workers?

erkanerol · 2023-02-27T10:25:00Z

It depends on whether you want to define different unhealthyConditions for CP nodes. Note that maxUnhealthy will be applied for all nodes if you use a single MHC.

For CAPVCD, I added for only worker nodes. I want to observe the experience for a while.

bavarianbidi · 2023-02-27T10:27:00Z

As control plane nodes are in multiple cases a bit "different" i would propose doing two different MachineHealthChecks.

calvix · 2023-02-28T10:04:20Z

So in the end, this is only applied for the control plane nodes as stated in the issue name as the support for machine-pool indeed does not exist, and I added a filter to avoid killing the bastion node.

alex-dabija · 2023-03-01T08:40:42Z

We'll update the following MCs:

golem;
goat;
grizzly.

calvix · 2023-03-01T13:02:30Z

done

alex-dabija added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service team/hydra topic/capi provider/cluster-api-aws Cluster API based running on AWS kind/story labels Feb 9, 2023

alex-dabija added this to Roadmap Feb 9, 2023

alex-dabija moved this to Ready Soon (<4 weeks) in Roadmap Feb 9, 2023

alex-dabija mentioned this issue Feb 9, 2023

CAPA cluster upgrades stability #1777

Closed

14 tasks

calvix self-assigned this Feb 22, 2023

calvix mentioned this issue Feb 23, 2023

add-machine-health-check (#223 giantswarm/cluster-aws#223

Merged

calvix closed this as completed Feb 28, 2023

github-project-automation bot moved this from Ready Soon (<4 weeks) to Released 🎉 in Roadmap Feb 28, 2023

alex-dabija reopened this Mar 1, 2023

calvix closed this as completed Mar 1, 2023

AndiDog mentioned this issue Apr 12, 2023

Workload clusters not coming up because of Docker registry rate limit / set registry credentials for workload clusters #2357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPA control plane machine health checks #1978

CAPA control plane machine health checks #1978

alex-dabija commented Feb 8, 2023 •

edited

Loading

bavarianbidi commented Feb 20, 2023

erkanerol commented Feb 20, 2023

fiunchinho commented Feb 27, 2023

erkanerol commented Feb 27, 2023

bavarianbidi commented Feb 27, 2023

calvix commented Feb 28, 2023

alex-dabija commented Mar 1, 2023 •

edited by calvix

Loading

calvix commented Mar 1, 2023

CAPA control plane machine health checks #1978

CAPA control plane machine health checks #1978

Comments

alex-dabija commented Feb 8, 2023 • edited Loading

Story

Background

Links

bavarianbidi commented Feb 20, 2023

erkanerol commented Feb 20, 2023

fiunchinho commented Feb 27, 2023

erkanerol commented Feb 27, 2023

bavarianbidi commented Feb 27, 2023

calvix commented Feb 28, 2023

alex-dabija commented Mar 1, 2023 • edited by calvix Loading

calvix commented Mar 1, 2023

alex-dabija commented Feb 8, 2023 •

edited

Loading

alex-dabija commented Mar 1, 2023 •

edited by calvix

Loading