Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPA control plane machine health checks #1978

Closed
alex-dabija opened this issue Feb 8, 2023 · 8 comments
Closed

CAPA control plane machine health checks #1978

alex-dabija opened this issue Feb 8, 2023 · 8 comments
Assignees
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/story provider/cluster-api-aws Cluster API based running on AWS topic/capi

Comments

@alex-dabija
Copy link

alex-dabija commented Feb 8, 2023

Story

-As a cluster admin, I want the control plane nodes to be recreated if basic machine health checks fail in order to improve the stability of CAPA clusters.

Towards epic.

Background

CAPA clusters with control plane machines in invalid states fail to recover because there are no health checks configured.

CAPI supports machine health checks for control plane nodes and they need to be configured for CAPA clusters.

Links

@alex-dabija alex-dabija added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service team/hydra topic/capi provider/cluster-api-aws Cluster API based running on AWS kind/story labels Feb 9, 2023
@alex-dabija alex-dabija moved this to Ready Soon (<4 weeks) in Roadmap Feb 9, 2023
@bavarianbidi
Copy link

my 2 cents:

contrary to my initial concerns about having MachineHealthChecks for KCP the current implementation in CAPI is strived to not break an existing etcdCluster (result of func canSafelyRemoveEtcdMember from kcp remediation).

The only "critical" path i still see is the fact that if an etcd cluster is "broken" and a human tries to fix the etcd cluster by doing some manual steps, the MachineHealthCheck could kick in and rotate a machine you're still relying on.

Kudos to Erkan for creating this great list of Scenarios:

Example Scenarios (tested in Add MachineHealthCheck for all nodes in Openstack):
Setup: 1 Control Plane + 3 Infra
Case: The control plane node is unhealthy
Behavior: No remediation because of etcd quorum loss.

Setup: 1 Control Plane + 3 Infra
Case: One infra node is unhealthy
Behavior: Infra node will be remediated

Setup: 1 Control Plane + 3 Infra
Case: 2 infra nodes are unhealthy
Behavior: No remediation because of maxUnhealthy limit (50% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: One control plane node is unhealthy
Behavior: Control plane node will be remediated

Setup: 3 Control Plane + 3 Infra
Case: One control plane, one infra node are unhealthy
Behavior: Both will be remediated (33% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: Two infra nodes are unhealthy
Behavior: Both will be remediated (33% > 40%)

Setup: 3 Control Plane + 3 Infra
Case: Two control plane nodes are unhealthy
Behavior: No remediation because of etcd quorum loss.

Open issues for other providers:

@erkanerol
Copy link

The only "critical" path i still see is the fact that if an etcd cluster is "broken" and a human tries to fix the etcd cluster by doing some manual steps, the MachineHealthCheck could kick in and rotate a machine you're still relying on.

To solve this issue, we were disabling/enabling machine health checks while fixing the cluster manually.

# to disable MachineHealthCheck temporarily
function annotate_machines_to_disable_remediation(){
    local namespace="$1"
    local cluster="$2"
    kubectl -n "$namespace" annotate machine -l cluster.x-k8s.io/cluster-name="$cluster" "cluster.x-k8s.io/skip-remediation"=""
}

function deannotate_machines_to_enable_remediation(){
    local namespace="$1"
    local cluster="$2"
    kubectl -n "$namespace" annotate machine -l cluster.x-k8s.io/cluster-name="$cluster" "cluster.x-k8s.io/skip-remediation"-
}

See https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/remove-errors-from-capi-capo-crs/


The biggest issue we had in CAPO was that there was a bug in CAPO controller that adds transient errors to infra CRs, which are propagated to CAPI CRs too. We had to delete them manually to fix clusters because of the maxUnhealthy limit. See https://github.com/giantswarm/giantswarm/issues/22443

@fiunchinho
Copy link
Member

@calvix @erkanerol @bavarianbidi do you think having two different MachineHealthChecks (one for the control plane and another one for the workers) is needed / better? Any issues re-using the same MachineHealthCheck for both the CP and the workers?

@erkanerol
Copy link

It depends on whether you want to define different unhealthyConditions for CP nodes. Note that maxUnhealthy will be applied for all nodes if you use a single MHC.

For CAPVCD, I added for only worker nodes. I want to observe the experience for a while.

@bavarianbidi
Copy link

As control plane nodes are in multiple cases a bit "different" i would propose doing two different MachineHealthChecks.

@calvix
Copy link

calvix commented Feb 28, 2023

So in the end, this is only applied for the control plane nodes as stated in the issue name as the support for machine-pool indeed does not exist, and I added a filter to avoid killing the bastion node.

@calvix calvix closed this as completed Feb 28, 2023
@github-project-automation github-project-automation bot moved this from Ready Soon (<4 weeks) to Released 🎉 in Roadmap Feb 28, 2023
@alex-dabija alex-dabija reopened this Mar 1, 2023
@alex-dabija
Copy link
Author

alex-dabija commented Mar 1, 2023

We'll update the following MCs:

  • golem;
  • goat;
  • grizzly.

@calvix
Copy link

calvix commented Mar 1, 2023

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/story provider/cluster-api-aws Cluster API based running on AWS topic/capi
Projects
Archived in project
Development

No branches or pull requests

5 participants