GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

altieresfreitas · 2024-10-16T15:43:16Z

In a multi-cluster setup, k8gb currently determines whether an IP should be added to DNS based on the health of application pods tied to a specified service. However, this does not account for the health of centralized routing components like gateways (e.g., Istio Gateway) or Ingress controllers.

During local testing, I observed that k8gb did not remove the IP for a data center where the gateway or Ingress controller was scaled down to zero. It appears that k8gb doesn’t currently consider the health of these routing components in its failover decisions.

Would it be possible to enhance k8gb to include health checks for both the gateway or Ingress controller and the application pods? This could improve the accuracy of failover responses in scenarios where the availability of these routing components directly impacts traffic distribution across clusters.

abaguas · 2024-10-16T16:28:29Z

Thank you for creating an issue @altieresfreitas
Indeed, the k8gb controller does not check the health of the ingress controller.
We discussed this in the past. Surprisingly I couldn’t find an issue tracking it but this is a feature we want to provide

altieresfreitas · 2024-10-17T13:49:56Z

Thank you for the confirmation, @abaguas. I’m glad to hear that this feature aligns with previous discussions. I’d be interested in contributing to its development. Could I get the assignment for this issue? I’d be happy to start working on a PR to implement it.

abaguas · 2024-10-17T19:33:36Z

That is great, we appreciate your initiative! Here is my implementation proposal:

Create a new feature flag in helm:k8gb.ingressController.healthCheck.enabled=false. You can then pass it to the controller. The logic is similar to clusterGeoTag. As it is a new feature we would like users to have the option to toggle it.
Extend the interface GslbReferenceResolver with a function to fetch the LbService:

type GslbReferenceResolver interface {
	// GetServers retrieves GSLB the server configuration
	GetServers() ([]*k8gbv1beta1.Server, error)
	// GetGslbExposedIPs retrieves the load balancer IP address of the GSLB
	GetGslbExposedIPs(utils.DNSList) ([]string, error)
	// GetLbService retrieves the services that exposes the ingress controller
        GetLbService() (NamespacedName)
}

In the main loop of the controller, in gslb_controller_reconciliation.go, at the end of the block // == Reference resolution == fetch the lbService from the refResolver and store it in gslb.Status.LoadBalancer.Service.
In the same gslb_controller_reconciliation.go, at the start of the // == external-dns dnsendpoints CRs == block call a new function that checks the health of the load balancer and stores it in gslb.Status.LoadBalancer.Status. Don't forget to guard it by the feature flag. The logic should be rather similar to what we have in getServiceHealthStatus.
Finally, in the function getServiceHealthStatus add the logic to determine if the app is healthy considering also the health of the ingress controller.

Overall the structure of the code is:

the refResolver is in charge of fetching data from Kubernetes resources, so that the rest of the code can stay generic and not care about the type of Ingress used. The refResolver knows nothing about business logic.
data fetched by the refResolver is stored in the status of the GSLB CRD. This allows sharing of information between functions and gives visibility to the user.
the rest of the code is responsible for the business logic. At the end it writes the result back to the status of the GSLB CRD, again to share the results with the user and facilitate debugging.

If you have any questions or suggestion let me know. Looking forward to your PR!

altieresfreitas mentioned this issue Oct 17, 2024

Add capability to check the health status of upstream gateway or ingress controller LB services #1756

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

altieresfreitas commented Oct 16, 2024

abaguas commented Oct 16, 2024

altieresfreitas commented Oct 17, 2024

abaguas commented Oct 17, 2024

GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

Comments

altieresfreitas commented Oct 16, 2024

abaguas commented Oct 16, 2024

altieresfreitas commented Oct 17, 2024

abaguas commented Oct 17, 2024