You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We only have implementation for AWS DNS Health checks, and they will only function if the endpoint is an A record and not a CNAME.
We do not have a known way of implementing GCP Health checks, if the clusters are not in Google Cloud.
We have no current plan for implementing Azure health checks, but it is implemented quite differently to AWS Health checks.
Use cases we want to solve
As a cluster admin, I want to ensure that NXDomain responses are avoided when all endpoints are unhealthy.
As a cluster admin, I want to ensure that unhealthy responses will not be included the DNS lookup.
As a cluster admin, I want to be able to set health checks against CNAME records as well as A records.
As a cluster admin, I want to be able to create health checks regardless of my DNS Provider.
Proposed approach
We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.
This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.
The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.
When is a probe unhealthy
A probe will write to a probe CR a few pieces of information:
When it last checked
How many consecutive failures have occurred
When is a record unhealthy
The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).
When are unhealthy records removed from the zone
A record is removed from the zone if:
There are more records left in the zone that can respond (IE within the same GEO if defined)
AND The probe is unhealthy
Before removing a record, the zone will be consulted. After a record is removed, the owner of that record will go into a validation loop to ensure at least one record will be returned (for its GEO or globally) if no records are found it will re-publish its own record (regardless of health)
Update our tests to include tests of the health check probes.
Tradeoffs
We will not be able to report on the health of the workload from other geographical areas.
If the cluster goes away, the controller dies or is denied access to the zone; the unhealthy records will stay in the DNS response until manual intervention.
If all clusters but one are unhealthy, and the last healthy cluster is gracefully deleted, there will temporarily (until the next time an unhealthy cluster reconciles) be an empty zone.
If the controller is acting with a networking configuration that allows it to access itself when the internet cannot, or vice versa, the health check probe will be inaccurate.
If the probes are failing to execute, or failing to update the probe CR, then all endpoints will be considered healthy.
Related Information
initial thoughts on health checks, and potential for cross-cluster health checks in the future: here
The text was updated successfully, but these errors were encountered:
Prior Art
https://github.com/Kuadrant/multicluster-gateway-controller/tree/60f13a1f7ad8f2b82e3f344a425285f69fb91223/pkg/dns/health
Terminology
Tasks
Executing health checks
Consulting health checks
E2E Test cases
done under #282
done under x
Black box testing
Load Testing
Documenting Health Checks
Current State
Use cases we want to solve
Proposed approach
We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.
This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.
The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.
When is a probe unhealthy
A probe will write to a probe CR a few pieces of information:
When is a record unhealthy
The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).
When are unhealthy records removed from the zone
A record is removed from the zone if:
Before removing a record, the zone will be consulted. After a record is removed, the owner of that record will go into a validation loop to ensure at least one record will be returned (for its GEO or globally) if no records are found it will re-publish its own record (regardless of health)
Update our tests to include tests of the health check probes.
Tradeoffs
Related Information
initial thoughts on health checks, and potential for cross-cluster health checks in the future: here
The text was updated successfully, but these errors were encountered: