diff --git a/docs/book/src/SUMMARY.md b/docs/book/src/SUMMARY.md index 845250d1b4de..dbd725f6db70 100644 --- a/docs/book/src/SUMMARY.md +++ b/docs/book/src/SUMMARY.md @@ -10,6 +10,7 @@ - [Using Custom Certificates](./tasks/certs/using-custom-certificates.md) - [Generating a Kubeconfig](./tasks/certs/generate-kubeconfig.md) - [Upgrade](./tasks/upgrade.md) + - [Configure a MachineHealthCheck](./tasks/healthcheck.md) - [clusterctl CLI](./clusterctl/overview.md) - [clusterctl Commands](clusterctl/commands/commands.md) - [init](clusterctl/commands/init.md) @@ -30,6 +31,7 @@ - [Machine](./developer/architecture/controllers/machine.md) - [MachineSet](./developer/architecture/controllers/machine-set.md) - [MachineDeployment](./developer/architecture/controllers/machine-deployment.md) + - [MachineHealthCheck](./developer/architecture/controllers/machine-health-check.md) - [Control Plane](./developer/architecture/controllers/control-plane.md) - [Provider Implementers](./developer/providers/implementers.md) - [v1alpha1 to v1alpha2](./developer/providers/v1alpha1-to-v1alpha2.md) diff --git a/docs/book/src/developer/architecture/controllers/machine-health-check.md b/docs/book/src/developer/architecture/controllers/machine-health-check.md new file mode 100644 index 000000000000..827bf68f06f2 --- /dev/null +++ b/docs/book/src/developer/architecture/controllers/machine-health-check.md @@ -0,0 +1,12 @@ +# MachineHealthCheck + +A MachineHealthCheck is responsible for remediating unhealthy [Machines](./machine.md). + +Its main responsibilities are: +* Checking the health of Nodes in [target clusters] against a list of unhealthy conditions +* Remediating Machine's for Nodes determined to be unhealthy + +![](../../../images/machinehealthcheck-controller.png) + + +[target clusters]: ../../../reference/glossary.md#target-cluster diff --git a/docs/book/src/images/machinehealthcheck-controller.plantuml b/docs/book/src/images/machinehealthcheck-controller.plantuml new file mode 100644 index 000000000000..2481d302e7ff --- /dev/null +++ b/docs/book/src/images/machinehealthcheck-controller.plantuml @@ -0,0 +1,35 @@ + +@startuml machinehealthcheck-controller + +start; +:Machine Health Check controller; +repeat + repeat + :MachineHealthCheck controller enqueues a Reconcile call; + if (Nodes being watched in remote cluster) then (no) + :Watch nodes in remote cluster; + else (yes) + endif + :Find targets: Machines matched by selector plus respective Nodes; + :Health check targets: Determine which Machines require remediation; + repeat while (Remediations are allowed (current unhealthy <= max unhealthy)) is (no) + -> yes; + repeat + if (Target requires remediation) then (yes) + if (Machine is owned by a MachineSet) then (yes) + if (Machine is a Control Plane Machine) then (no) + #LightBlue:Delete Machine; + else (yes) + endif + else (no) + endif + else (no) + endif + repeat while (more Targets) is (yes) + -> no; +repeat while (Targets likely to go unhealthy) is (yes: requeue with minimum + time before timeout as delay) +-> no; +stop; + +@enduml diff --git a/docs/book/src/images/machinehealthcheck-controller.png b/docs/book/src/images/machinehealthcheck-controller.png new file mode 100644 index 000000000000..1c9a60d4478f Binary files /dev/null and b/docs/book/src/images/machinehealthcheck-controller.png differ diff --git a/docs/book/src/tasks/healthcheck.md b/docs/book/src/tasks/healthcheck.md new file mode 100644 index 000000000000..bc82857babd1 --- /dev/null +++ b/docs/book/src/tasks/healthcheck.md @@ -0,0 +1,95 @@ +# Configure a MachineHealthCheck + +## Prerequisites + +Before attempting to configure a MachineHealthCheck, you should have a working [management cluster] with at least one MachineDeployment or MachineSet deployed. + +## What is a MachineHealthCheck? + +A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machine's within a Cluster should be considered unhealthy. + +When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node, +if any of these conditions is met for the duration of the timeout, the Machine will be remediated. +The action of remediating a Machine should trigger a new Machine to be created, to replace the failed one. + +## Creating a MachineHealthCheck + +Use the following example as a basis for creating a MachineHealthCheck: + +```yaml +apiVersion: cluster.x-k8s.io/v1alpha3 +kind: MachineHealthCheck +metadata: + name: capi-quickstart-node-unhealthy-5m +spec: + # clusterName is required to associate this MachineHealthCheck with a particular cluster + clusterName: capi-quickstart + # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy + maxUnhealthy: 40% + # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for + # a Node to join the cluster, before considering a Machine unhealthy + nodeStartupTimeout: 10m + # selector is used to determine which Machines should be health checked + selector: + matchLabels: + nodepool: nodepool-0 + # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its tiemout, the Machine is considered unhealthy + unhealthyConditions: + - type: Ready + status: Unknown + timeout: 300s + - type: Ready + status: "False" + timeout: 300s +``` + +## Remediation short-circuiting + +To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy, +short-circuiting is implemented to prevent further remediation via the `maxUnhealthy` field within the MachineHealthCheck spec. + +If the user defines a value for the `maxUnhealthy` field (either an absolute number or a percentage of the total Machines checked by this MachineHealthCheck), +before remediating any Machines, the MachineHealthCheck will compare the value of `maxUnhealthy` with the number of Machines it has determined to be unhealthy. +If the number of unhealthy Machines exceeds the limit set by `maxUnhealthy`, remediation will **not** be performed. + + + +#### With an absolute value + +If `maxUnhealthy` is set to `2`: +- If 2 or fewer nodes are unhealthy, remediation will be performed +- If 3 or more nodes are unhealthy, remediation will not be performed + +These values are independent of how many Machines are being checked by the MachineHealthCheck. + +#### With percentages + +If `maxUnhealthy` is set to `40%` and there are 25 Machines being checked: +- If 10 or fewer nodes are unhealthy, remediation will be performed +- If 11 or more nodes are unhealthy, remediation will not be performed + +If `maxUnhealthy` is set to `40%` and there are 6 Machines being checked: +- If 2 or fewer nodes are unhealthy, remediation will be performed +- If 3 or more nodes are unhealthy, remediation will not be performed + +Note, when the percentage is not a whole number, the allowed number is rounded down. + +## Limitations and Caveats of a MachineHealthCheck + +Before deploying a MachineHealthCheck, please familiarise yourself with the following limitations and caveats: + +- Only Machines owned by a MachineSet will be remediated by a MachineHealthCheck +- Control Plane Machines are currently not supported and will **not** be remediated if they are unhealthy +- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately +- If no Node joins the cluster for a Node after the `NodeStartupTimeout`, the Machine will be remediated +- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately + + +[management cluster]: ../reference/glossary.md#management-cluster diff --git a/docs/book/src/user/concepts.md b/docs/book/src/user/concepts.md index b2bcccf7ae3e..728daa549464 100644 --- a/docs/book/src/user/concepts.md +++ b/docs/book/src/user/concepts.md @@ -60,6 +60,15 @@ MachineSets work similar to regular POD [ReplicaSets](https://kubernetes.io/docs +### MachineHealthCheck + +A "MachineHealthCheck" defines a set of conditions for Nodes which allow the user to specify when a Node should be considered unhealthy. +If the Node matches the unhealthy conditions for a given user configured time, the MachineHealthCheck initiates remediation of the Node. + +Remediation of Nodes is performed by deleting the Machine that created the Node. +MachineHealthChecks will only remediate Nodes if they are owned by a MachineSet, +this ensures that the Kubernetes cluster does not lose capacity, as the MachineSet will create a new Machine to replace the failed Machine. + ### BootstrapData BootstrapData contains the machine or node role specific initialization data (usually cloud-init) used by the infrastructure provider to bootstrap a machine into a node.