Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on-cluster health checks - [Epic] #141

Open
5 of 19 tasks
philbrookes opened this issue May 28, 2024 · 0 comments
Open
5 of 19 tasks

on-cluster health checks - [Epic] #141

philbrookes opened this issue May 28, 2024 · 0 comments

Comments

@philbrookes
Copy link
Collaborator

philbrookes commented May 28, 2024

Prior Art

https://github.com/Kuadrant/multicluster-gateway-controller/tree/60f13a1f7ad8f2b82e3f344a425285f69fb91223/pkg/dns/health

Terminology

  • Leaf Record: Either a CNAME or IP address taken from the status of a gateway

Tasks

Executing health checks

Consulting health checks

E2E Test cases

done under #282

  • An unhealthy target is removed from a leaf record when other healthy targets are present on the same leaf
  • An unhealthy leaf is preserved when there is no path from the root domain to a healthy leaf record
  • Unhealthy endpoints are not published (not removed if already published)
  • Unhealthy workload is noted correctly in health check probe CR and DNSRecord CR
  • Healthy workload is noted correctly in health check probe CR and DNSRecord CR
  • HealthCheckProbe CR is updated correctly when an unhealthy endpoint becomes healthy

done under x

  • Metrics are emitted when an unhealthy workload is detected
  • An unhealthy leaf record and it's dead branch are removed when a provider has a path from root domain to a healthy leaf record
  • Multiple controllers removing records at the same time will not result in no leafs in the provider

Black box testing

  • Add black box tests (test it from users perspective)

Load Testing

  • Add a test for a gateway with 64 listeners and 2 CNAMEs resolving to 2 IPs (i.e. 128 probes against 2 IPs)

Documenting Health Checks

Current State

  • We only have implementation for AWS DNS Health checks, and they will only function if the endpoint is an A record and not a CNAME.
  • We do not have a known way of implementing GCP Health checks, if the clusters are not in Google Cloud.
  • We have no current plan for implementing Azure health checks, but it is implemented quite differently to AWS Health checks.

Use cases we want to solve

  • As a cluster admin, I want to ensure that NXDomain responses are avoided when all endpoints are unhealthy.
  • As a cluster admin, I want to ensure that unhealthy responses will not be included the DNS lookup.
  • As a cluster admin, I want to be able to set health checks against CNAME records as well as A records.
  • As a cluster admin, I want to be able to create health checks regardless of my DNS Provider.

Proposed approach

We will implement local health checks, where the workload on the cluster is requested by a probe running on the cluster, through the external gateway, to simulate real internet traffic.

This will not require any changes to our API, we can reuse the existing health check specification in the DNS Policy exactly as is.

The results of the probe will be stored on a CR locally (one per probe), and also emitted as metrics.

When is a probe unhealthy

A probe will write to a probe CR a few pieces of information:

  • When it last checked
  • How many consecutive failures have occurred

When is a record unhealthy

The DNS Policy will specify a fault tolerance, and if the consecutive failures on the relevant probe CR are above that number, then the corresponding record is considered unhealthy, unless the last checked time is too old (i.e. a probe has stopped updating the probe CR).

When are unhealthy records removed from the zone

A record is removed from the zone if:

  • There are more records left in the zone that can respond (IE within the same GEO if defined)
  • AND The probe is unhealthy
    Before removing a record, the zone will be consulted. After a record is removed, the owner of that record will go into a validation loop to ensure at least one record will be returned (for its GEO or globally) if no records are found it will re-publish its own record (regardless of health)

Update our tests to include tests of the health check probes.

Tradeoffs

  • We will not be able to report on the health of the workload from other geographical areas.
  • If the cluster goes away, the controller dies or is denied access to the zone; the unhealthy records will stay in the DNS response until manual intervention.
  • If all clusters but one are unhealthy, and the last healthy cluster is gracefully deleted, there will temporarily (until the next time an unhealthy cluster reconciles) be an empty zone.
  • If the controller is acting with a networking configuration that allows it to access itself when the internet cannot, or vice versa, the health check probe will be inaccurate.
  • If the probes are failing to execute, or failing to update the probe CR, then all endpoints will be considered healthy.

Related Information

initial thoughts on health checks, and potential for cross-cluster health checks in the future: here

@philbrookes philbrookes changed the title on-cluster health checks Feature: on-cluster health checks May 28, 2024
@philbrookes philbrookes self-assigned this May 28, 2024
@maleck13 maleck13 added this to the kuadrant-v1 milestone May 31, 2024
@maleck13 maleck13 changed the title Feature: on-cluster health checks on-cluster health checks May 31, 2024
@philbrookes philbrookes added next and removed next labels Jun 13, 2024
@maleck13 maleck13 changed the title on-cluster health checks on-cluster health checks - [Epic] Jul 25, 2024
@philbrookes philbrookes removed their assignment Aug 1, 2024
@maleck13 maleck13 added the kind/epic Epic label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants