Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSLB does not consider gateway or Ingress controller failure in multi-cluster setup #1754

Open
altieresfreitas opened this issue Oct 16, 2024 · 3 comments

Comments

@altieresfreitas
Copy link
Contributor

In a multi-cluster setup, k8gb currently determines whether an IP should be added to DNS based on the health of application pods tied to a specified service. However, this does not account for the health of centralized routing components like gateways (e.g., Istio Gateway) or Ingress controllers.

During local testing, I observed that k8gb did not remove the IP for a data center where the gateway or Ingress controller was scaled down to zero. It appears that k8gb doesn’t currently consider the health of these routing components in its failover decisions.

Would it be possible to enhance k8gb to include health checks for both the gateway or Ingress controller and the application pods? This could improve the accuracy of failover responses in scenarios where the availability of these routing components directly impacts traffic distribution across clusters.

@abaguas
Copy link
Collaborator

abaguas commented Oct 16, 2024

Thank you for creating an issue @altieresfreitas
Indeed, the k8gb controller does not check the health of the ingress controller.
We discussed this in the past. Surprisingly I couldn’t find an issue tracking it but this is a feature we want to provide

@altieresfreitas
Copy link
Contributor Author

Thank you for the confirmation, @abaguas. I’m glad to hear that this feature aligns with previous discussions. I’d be interested in contributing to its development. Could I get the assignment for this issue? I’d be happy to start working on a PR to implement it.

@abaguas
Copy link
Collaborator

abaguas commented Oct 17, 2024

That is great, we appreciate your initiative! Here is my implementation proposal:

  1. Create a new feature flag in helm:k8gb.ingressController.healthCheck.enabled=false. You can then pass it to the controller. The logic is similar to clusterGeoTag. As it is a new feature we would like users to have the option to toggle it.
  2. Extend the interface GslbReferenceResolver with a function to fetch the LbService:
type GslbReferenceResolver interface {
	// GetServers retrieves GSLB the server configuration
	GetServers() ([]*k8gbv1beta1.Server, error)
	// GetGslbExposedIPs retrieves the load balancer IP address of the GSLB
	GetGslbExposedIPs(utils.DNSList) ([]string, error)
	// GetLbService retrieves the services that exposes the ingress controller
        GetLbService() (NamespacedName)
}
  1. In the main loop of the controller, in gslb_controller_reconciliation.go, at the end of the block // == Reference resolution == fetch the lbService from the refResolver and store it in gslb.Status.LoadBalancer.Service.
  2. In the same gslb_controller_reconciliation.go, at the start of the // == external-dns dnsendpoints CRs == block call a new function that checks the health of the load balancer and stores it in gslb.Status.LoadBalancer.Status. Don't forget to guard it by the feature flag. The logic should be rather similar to what we have in getServiceHealthStatus.
  3. Finally, in the function getServiceHealthStatus add the logic to determine if the app is healthy considering also the health of the ingress controller.

Overall the structure of the code is:

  • the refResolver is in charge of fetching data from Kubernetes resources, so that the rest of the code can stay generic and not care about the type of Ingress used. The refResolver knows nothing about business logic.
  • data fetched by the refResolver is stored in the status of the GSLB CRD. This allows sharing of information between functions and gives visibility to the user.
  • the rest of the code is responsible for the business logic. At the end it writes the result back to the status of the GSLB CRD, again to share the results with the user and facilitate debugging.

If you have any questions or suggestion let me know. Looking forward to your PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants