kubernetes-sigs · k8s-ci-robot · Mar 6, 2020 · Mar 6, 2020
diff --git a/docs/book/src/SUMMARY.md b/docs/book/src/SUMMARY.md
@@ -10,6 +10,7 @@
         - [Using Custom Certificates](./tasks/certs/using-custom-certificates.md)
         - [Generating a Kubeconfig](./tasks/certs/generate-kubeconfig.md)
     - [Upgrade](./tasks/upgrade.md)
+    - [Configure a MachineHealthCheck](./tasks/healthcheck.md)
 - [clusterctl CLI](./clusterctl/overview.md)
     - [clusterctl Commands](clusterctl/commands/commands.md)
         - [init](clusterctl/commands/init.md)
@@ -30,6 +31,7 @@
         - [Machine](./developer/architecture/controllers/machine.md)
         - [MachineSet](./developer/architecture/controllers/machine-set.md)
         - [MachineDeployment](./developer/architecture/controllers/machine-deployment.md)
+        - [MachineHealthCheck](./developer/architecture/controllers/machine-health-check.md)
         - [Control Plane](./developer/architecture/controllers/control-plane.md)
     - [Provider Implementers](./developer/providers/implementers.md)
         - [v1alpha1 to v1alpha2](./developer/providers/v1alpha1-to-v1alpha2.md)

diff --git a/docs/book/src/developer/architecture/controllers/machine-health-check.md b/docs/book/src/developer/architecture/controllers/machine-health-check.md
@@ -0,0 +1,12 @@
+# MachineHealthCheck
+
+A MachineHealthCheck is responsible for remediating unhealthy [Machines](./machine.md).
+
+Its main responsibilities are:
+* Checking the health of Nodes in [target clusters] against a list of unhealthy conditions
+* Remediating Machine's for Nodes determined to be unhealthy
+
+![](../../../images/machinehealthcheck-controller.png)
+
+<!-- links -->
+[target clusters]: ../../../reference/glossary.md#target-cluster
diff --git a/docs/book/src/images/machinehealthcheck-controller.plantuml b/docs/book/src/images/machinehealthcheck-controller.plantuml
@@ -0,0 +1,35 @@
+
+@startuml machinehealthcheck-controller
+
+start;
+:Machine Health Check controller;
+repeat
+  repeat
+    :MachineHealthCheck controller enqueues a Reconcile call;
+    if (Nodes being watched in remote cluster) then (no)
+      :Watch nodes in remote cluster;
+    else (yes)
+    endif
+    :Find targets: Machines matched by selector plus respective Nodes;
+    :Health check targets: Determine which Machines require remediation;
+  repeat while (Remediations are allowed (current unhealthy <= max unhealthy)) is (no)
+  -> yes;
+  repeat
+    if (Target requires remediation) then (yes)
+      if (Machine is owned by a MachineSet) then (yes)
+        if (Machine is a Control Plane Machine) then (no)
+          #LightBlue:Delete Machine;
+        else (yes)
+        endif
+      else (no)
+      endif
+      else (no)
+    endif
+  repeat while (more Targets) is (yes)
+  -> no;
+repeat while (Targets likely to go unhealthy) is (yes: requeue with minimum
+  time before timeout as delay)
+-> no;
+stop;
+
+@enduml
diff --git a/docs/book/src/images/machinehealthcheck-controller.png b/docs/book/src/images/machinehealthcheck-controller.png
diff --git a/docs/book/src/tasks/healthcheck.md b/docs/book/src/tasks/healthcheck.md
@@ -0,0 +1,95 @@
+# Configure a MachineHealthCheck
+
+## Prerequisites
+
+Before attempting to configure a MachineHealthCheck, you should have a working [management cluster] with at least one MachineDeployment or MachineSet deployed.
+
+## What is a MachineHealthCheck?
+
+A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machine's within a Cluster should be considered unhealthy.
+
+When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node,
+if any of these conditions is met for the duration of the timeout, the Machine will be remediated.
+The action of remediating a Machine should trigger a new Machine to be created, to replace the failed one.
+
+## Creating a MachineHealthCheck
+
+Use the following example as a basis for creating a MachineHealthCheck:
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1alpha3
+kind: MachineHealthCheck
+metadata:
+  name: capi-quickstart-node-unhealthy-5m
+spec:
+  # clusterName is required to associate this MachineHealthCheck with a particular cluster
+  clusterName: capi-quickstart
+  # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
+  maxUnhealthy: 40%
+  # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
+  # a Node to join the cluster, before considering a Machine unhealthy
+  nodeStartupTimeout: 10m
+  # selector is used to determine which Machines should be health checked
+  selector:
+    matchLabels:
+      nodepool: nodepool-0
+  # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its tiemout, the Machine is considered unhealthy
+  unhealthyConditions:
+  - type: Ready
+    status: Unknown
+    timeout: 300s
+  - type: Ready
+    status: "False"
+    timeout: 300s
+```
+
+## Remediation short-circuiting
+
+To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
+short-circuiting is implemented to prevent further remediation via the `maxUnhealthy` field within the MachineHealthCheck spec.
+
+If the user defines a value for the `maxUnhealthy` field (either an absolute number or a percentage of the total Machines checked by this MachineHealthCheck),
+before remediating any Machines, the MachineHealthCheck will compare the value of `maxUnhealthy` with the number of Machines it has determined to be unhealthy.
+If the number of unhealthy Machines exceeds the limit set by `maxUnhealthy`, remediation will **not** be performed.
+
+<aside class="note warning">
+
+<h1> Warning </h1>
+
+The default value for `maxUnhealthy` is `100%`.
+This means the short circuiting mechanism is **disabled by default** and Machines will be remediated no matter the state of the cluster.
+
+</aside>
+
+#### With an absolute value
+
+If `maxUnhealthy` is set to `2`:
+- If 2 or fewer nodes are unhealthy, remediation will be performed
+- If 3 or more nodes are unhealthy, remediation will not be performed
+
+These values are independent of how many Machines are being checked by the MachineHealthCheck.
+
+#### With percentages
+
+If `maxUnhealthy` is set to `40%` and there are 25 Machines being checked:
+- If 10 or fewer nodes are unhealthy, remediation will be performed
+- If 11 or more nodes are unhealthy, remediation will not be performed
+
+If `maxUnhealthy` is set to `40%` and there are 6 Machines being checked:
+- If 2 or fewer nodes are unhealthy, remediation will be performed
+- If 3 or more nodes are unhealthy, remediation will not be performed
+
+Note, when the percentage is not a whole number, the allowed number is rounded down.
+
+## Limitations and Caveats of a MachineHealthCheck
+
+Before deploying a MachineHealthCheck, please familiarise yourself with the following limitations and caveats:
+
+- Only Machines owned by a MachineSet will be remediated by a MachineHealthCheck
+- Control Plane Machines are currently not supported and will **not** be remediated if they are unhealthy
+- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
+- If no Node joins the cluster for a Node after the `NodeStartupTimeout`, the Machine will be remediated
+- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately
+
+<!-- links -->
+[management cluster]: ../reference/glossary.md#management-cluster
diff --git a/docs/book/src/user/concepts.md b/docs/book/src/user/concepts.md
@@ -60,6 +60,15 @@ MachineSets work similar to regular POD [ReplicaSets](https://kubernetes.io/docs
 
 <!--TODO-->
 
+### MachineHealthCheck
+
+A "MachineHealthCheck" defines a set of conditions for Nodes which allow the user to specify when a Node should be considered unhealthy.
+If the Node matches the unhealthy conditions for a given user configured time, the MachineHealthCheck initiates remediation of the Node.
+
+Remediation of Nodes is performed by deleting the Machine that created the Node.
+MachineHealthChecks will only remediate Nodes if they are owned by a MachineSet,
+this ensures that the Kubernetes cluster does not lose capacity, as the MachineSet will create a new Machine to replace the failed Machine.
+
 ### BootstrapData
 
 BootstrapData contains the machine or node role specific initialization data (usually cloud-init) used by the infrastructure provider to bootstrap a machine into a node.