Skip to content

Commit

Permalink
Add documentation for MachineHealthChecks
Browse files Browse the repository at this point in the history
  • Loading branch information
JoelSpeed committed Mar 6, 2020
1 parent ea7a9f4 commit 011ed38
Show file tree
Hide file tree
Showing 6 changed files with 153 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
- [Using Custom Certificates](./tasks/certs/using-custom-certificates.md)
- [Generating a Kubeconfig](./tasks/certs/generate-kubeconfig.md)
- [Upgrade](./tasks/upgrade.md)
- [Configure a MachineHealthCheck](./tasks/healthcheck.md)
- [clusterctl CLI](./clusterctl/overview.md)
- [clusterctl Commands](clusterctl/commands/commands.md)
- [init](clusterctl/commands/init.md)
Expand All @@ -30,6 +31,7 @@
- [Machine](./developer/architecture/controllers/machine.md)
- [MachineSet](./developer/architecture/controllers/machine-set.md)
- [MachineDeployment](./developer/architecture/controllers/machine-deployment.md)
- [MachineHealthCheck](./developer/architecture/controllers/machine-health-check.md)
- [Control Plane](./developer/architecture/controllers/control-plane.md)
- [Provider Implementers](./developer/providers/implementers.md)
- [v1alpha1 to v1alpha2](./developer/providers/v1alpha1-to-v1alpha2.md)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# MachineHealthCheck

A MachineHealthCheck is responsible for remediating unhealthy [Machines](./machine.md).

Its main responsibilities are:
* Checking the health of Nodes in [target clusters] against a list of unhealthy conditions
* Remediating Machine's for Nodes determined to be unhealthy

![](../../../images/machinehealthcheck-controller.png)

<!-- links -->
[target clusters]: ../../../reference/glossary.md#target-cluster
35 changes: 35 additions & 0 deletions docs/book/src/images/machinehealthcheck-controller.plantuml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

@startuml machinehealthcheck-controller

start;
:Machine Health Check controller;
repeat
repeat
:MachineHealthCheck controller enqueues a Reconcile call;
if (Nodes being watched in remote cluster) then (no)
:Watch nodes in remote cluster;
else (yes)
endif
:Find targets: Machines matched by selector plus respective Nodes;
:Health check targets: Determine which Machines require remediation;
repeat while (Remediations are allowed (current unhealthy <= max unhealthy)) is (no)
-> yes;
repeat
if (Target requires remediation) then (yes)
if (Machine is owned by a MachineSet) then (yes)
if (Machine is a Control Plane Machine) then (no)
#LightBlue:Delete Machine;
else (yes)
endif
else (no)
endif
else (no)
endif
repeat while (more Targets) is (yes)
-> no;
repeat while (Targets likely to go unhealthy) is (yes: requeue with minimum
time before timeout as delay)
-> no;
stop;

@enduml
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
95 changes: 95 additions & 0 deletions docs/book/src/tasks/healthcheck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Configure a MachineHealthCheck

## Prerequisites

Before attempting to configure a MachineHealthCheck, you should have a working [management cluster] with at least one MachineDeployment or MachineSet deployed.

## What is a MachineHealthCheck?

A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machine's within a Cluster should be considered unhealthy.

When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node,
if any of these conditions is met for the duration of the timeout, the Machine will be remediated.
The action of remediating a Machine should trigger a new Machine to be created, to replace the failed one.

## Creating a MachineHealthCheck

Use the following example as a basis for creating a MachineHealthCheck:

```yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
name: capi-quickstart-node-unhealthy-5m
spec:
# clusterName is required to associate this MachineHealthCheck with a particular cluster
clusterName: capi-quickstart
# (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
maxUnhealthy: 40%
# (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
# a Node to join the cluster, before considering a Machine unhealthy
nodeStartupTimeout: 10m
# selector is used to determine which Machines should be health checked
selector:
matchLabels:
nodepool: nodepool-0
# Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its tiemout, the Machine is considered unhealthy
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
```
## Remediation short-circuiting
To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
short-circuiting is implemented to prevent further remediation via the `maxUnhealthy` field within the MachineHealthCheck spec.

If the user defines a value for the `maxUnhealthy` field (either an absolute number or a percentage of the total Machines checked by this MachineHealthCheck),
before remediating any Machines, the MachineHealthCheck will compare the value of `maxUnhealthy` with the number of Machines it has determined to be unhealthy.
If the number of unhealthy Machines exceeds the limit set by `maxUnhealthy`, remediation will **not** be performed.

<aside class="note warning">

<h1> Warning </h1>

The default value for `maxUnhealthy` is `100%`.
This means the short circuiting mechanism is **disabled by default** and Machines will be remediated no matter the state of the cluster.

</aside>

#### With an absolute value

If `maxUnhealthy` is set to `2`:
- If 2 or fewer nodes are unhealthy, remediation will be performed
- If 3 or more nodes are unhealthy, remediation will not be performed

These values are independent of how many Machines are being checked by the MachineHealthCheck.

#### With percentages

If `maxUnhealthy` is set to `40%` and there are 25 Machines being checked:
- If 10 or fewer nodes are unhealthy, remediation will be performed
- If 11 or more nodes are unhealthy, remediation will not be performed

If `maxUnhealthy` is set to `40%` and there are 6 Machines being checked:
- If 2 or fewer nodes are unhealthy, remediation will be performed
- If 3 or more nodes are unhealthy, remediation will not be performed

Note, when the percentage is not a whole number, the allowed number is rounded down.

## Limitations and Caveats of a MachineHealthCheck

Before deploying a MachineHealthCheck, please familiarise yourself with the following limitations and caveats:

- Only Machines owned by a MachineSet will be remediated by a MachineHealthCheck
- Control Plane Machines are currently not supported and will **not** be remediated if they are unhealthy
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
- If no Node joins the cluster for a Node after the `NodeStartupTimeout`, the Machine will be remediated
- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately

<!-- links -->
[management cluster]: ../reference/glossary.md#management-cluster
9 changes: 9 additions & 0 deletions docs/book/src/user/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,15 @@ MachineSets work similar to regular POD [ReplicaSets](https://kubernetes.io/docs

<!--TODO-->

### MachineHealthCheck

A "MachineHealthCheck" defines a set of conditions for Nodes which allow the user to specify when a Node should be considered unhealthy.
If the Node matches the unhealthy conditions for a given user configured time, the MachineHealthCheck initiates remediation of the Node.

Remediation of Nodes is performed by deleting the Machine that created the Node.
MachineHealthChecks will only remediate Nodes if they are owned by a MachineSet,
this ensures that the Kubernetes cluster does not lose capacity, as the MachineSet will create a new Machine to replace the failed Machine.

### BootstrapData

BootstrapData contains the machine or node role specific initialization data (usually cloud-init) used by the infrastructure provider to bootstrap a machine into a node.
Expand Down

0 comments on commit 011ed38

Please sign in to comment.