on-cluster health checks - [Epic] #141

philbrookes · 2024-05-28T08:06:58Z

Prior Art

https://github.com/Kuadrant/multicluster-gateway-controller/tree/60f13a1f7ad8f2b82e3f344a425285f69fb91223/pkg/dns/health

Terminology

Leaf Record: Either a CNAME or IP address taken from the status of a gateway

Tasks

Executing health checks

Consulting health checks

Update DNS Record reconciler to consult relevant HealthCheckProbe CR during DNS publishing #148

E2E Test cases

done under #282

An unhealthy target is removed from a leaf record when other healthy targets are present on the same leaf
An unhealthy leaf is preserved when there is no path from the root domain to a healthy leaf record
Unhealthy endpoints are not published (not removed if already published)
Unhealthy workload is noted correctly in health check probe CR and DNSRecord CR
Healthy workload is noted correctly in health check probe CR and DNSRecord CR
HealthCheckProbe CR is updated correctly when an unhealthy endpoint becomes healthy

done under x

Metrics are emitted when an unhealthy workload is detected
An unhealthy leaf record and it's dead branch are removed when a provider has a path from root domain to a healthy leaf record
Multiple controllers removing records at the same time will not result in no leafs in the provider

Black box testing

Add black box tests (test it from users perspective)

Load Testing

Add a test for a gateway with 64 listeners and 2 CNAMEs resolving to 2 IPs (i.e. 128 probes against 2 IPs)

Documenting Health Checks

Add a an overview doc that explains what our health checks do, what they don't do, how they work and how they can be used
Update permissions https://github.com/Kuadrant/dns-operator/pull/231/files#diff-47c1635217668aa30dddf05318dd1cde48e4755b1385469433356c1cdbb5c2f2 to reflect what we need without cloud healthchecks

Current State

We only have implementation for AWS DNS Health checks, and they will only function if the endpoint is an A record and not a CNAME.
We do not have a known way of implementing GCP Health checks, if the clusters are not in Google Cloud.
We have no current plan for implementing Azure health checks, but it is implemented quite differently to AWS Health checks.

Use cases we want to solve

As a cluster admin, I want to ensure that NXDomain responses are avoided when all endpoints are unhealthy.
As a cluster admin, I want to ensure that unhealthy responses will not be included the DNS lookup.
As a cluster admin, I want to be able to set health checks against CNAME records as well as A records.
As a cluster admin, I want to be able to create health checks regardless of my DNS Provider.

Proposed approach

We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.

This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.

The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.

When is a probe unhealthy

A probe will write to a probe CR a few pieces of information:

When it last checked
How many consecutive failures have occurred

When is a record unhealthy

The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).

When are unhealthy records removed from the zone

A record is removed from the zone if:

There are more records left in the zone that can respond (IE within the same GEO if defined)
AND The probe is unhealthy
Before removing a record, the zone will be consulted. After a record is removed, the owner of that record will go into a validation loop to ensure at least one record will be returned (for its GEO or globally) if no records are found it will re-publish its own record (regardless of health)

Update our tests to include tests of the health check probes.

Tradeoffs

We will not be able to report on the health of the workload from other geographical areas.
If the cluster goes away, the controller dies or is denied access to the zone; the unhealthy records will stay in the DNS response until manual intervention.
If all clusters but one are unhealthy, and the last healthy cluster is gracefully deleted, there will temporarily (until the next time an unhealthy cluster reconciles) be an empty zone.
If the controller is acting with a networking configuration that allows it to access itself when the internet cannot, or vice versa, the health check probe will be inaccurate.
If the probes are failing to execute, or failing to update the probe CR, then all endpoints will be considered healthy.

Related Information

initial thoughts on health checks, and potential for cross-cluster health checks in the future: here

philbrookes added kind/epic Epic healthchecks labels May 28, 2024

philbrookes changed the title ~~on-cluster health checks~~ Feature: on-cluster health checks May 28, 2024

philbrookes self-assigned this May 28, 2024

maleck13 added this to the kuadrant-v1 milestone May 31, 2024

maleck13 changed the title ~~Feature: on-cluster health checks~~ on-cluster health checks May 31, 2024

philbrookes added kind/feature kind/epic Epic and removed kind/epic Epic kind/feature labels Jun 4, 2024

ficap mentioned this issue Jun 11, 2024

DNSPolicy: add tests for on cluster health checks Kuadrant/testsuite#428

Open

philbrookes added next and removed next labels Jun 13, 2024

maleck13 mentioned this issue Jul 1, 2024

Bug health checks fail to create if a hostname is used #91

Closed

maleck13 changed the title ~~on-cluster health checks~~ on-cluster health checks - [Epic] Jul 25, 2024

philbrookes removed their assignment Aug 1, 2024

philbrookes added kind/feature and removed kind/epic Epic labels Aug 1, 2024

maleck13 added the kind/epic Epic label Aug 22, 2024

philbrookes mentioned this issue Sep 26, 2024

Capture resource requirements #239

Closed

maksymvavilov mentioned this issue Oct 25, 2024

Create e2e test for on-cluster healthchecks #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on-cluster health checks - [Epic] #141

on-cluster health checks - [Epic] #141

philbrookes commented May 28, 2024 •

edited by maleck13

Loading

on-cluster health checks - [Epic] #141

on-cluster health checks - [Epic] #141

Comments

philbrookes commented May 28, 2024 • edited by maleck13 Loading

Prior Art

Terminology

Tasks

Executing health checks

Consulting health checks

E2E Test cases

Black box testing

Load Testing

Documenting Health Checks

Current State

Use cases we want to solve

Proposed approach

When is a probe unhealthy

When is a record unhealthy

When are unhealthy records removed from the zone

Tradeoffs

Related Information

philbrookes commented May 28, 2024 •

edited by maleck13

Loading