-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821
Comments
Thanks for reporting the issue @ming-ddtechcg |
I've added some more labels, I think this should also be considered as bug |
*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: #821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
@aswinsuryan before we move this to done, what's the test coverage for this? Is it part of our E2E? Also, what about docs? |
Oh it was auto-closed with API patch merge. The implementation is yet to be added and UT and E2E. I will raise two separate issues for doc and e2e. |
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
* Add health checker implementation *Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: #821 Signed-off-by: Aswin Surayanarayanan <[email protected]> Co-authored-by: Thomas Pantelis <[email protected]>
@aswinsuryan Isn't this ready in 0.8-rc1 already? |
ah, just waiting for verification? |
Yes, we want @manosnoam to review and verify. |
@nyechiel @aswinsuryan |
*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io/submariner#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>
* Add health checker implementation *Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io/submariner#821 Signed-off-by: Aswin Surayanarayanan <[email protected]> Co-authored-by: Thomas Pantelis <[email protected]>
I am running v0.6.1 on the two Openshift 4.4.20 clusters with "--cable-driver libreswan" while they joined together, say "cluster A" and "cluster B".
Deployed the test nginx-unprivileged case from "KIND (Local Environment)" to both clusters and expose services with kubectl and subctl. Performed nslookup with the FQDN service name in the pod "tmp-shell", and two service IP addresses are returned as round-robin.
The test scenario is that to shut down the cluster A (I bring down all worker nodes) and the service will be routed to the cluster B only:
the result of the testing to return the cluster B service IP took approximately 25 minutes (the timings were more or less sometimes).
There was the cluster B Gateway describe while the cluster A was down after 10 minutes:
Recommendation:
If the lighthouse DNS can remove the down cluster service IP address off in 5 to 10 seconds (assume the good networking environment).
Or, we leave the options to users to configure it:
To define the parameter (--ha-heartbeat-interval x --ha-heartbeat-failure-count y) to declare the heartbeat ping failure between clusters while performing the subctl join. Like:
--ha-heartbeat-interval
it is a timer for how frequent sending the heartbeat from a health cluster to other joined clusters in x seconds. the range is 1 second <= x <= 600 seconds (default is 30 seconds)
--ha-heartbeat-failure-count
the number of the consecutive failure to declare the losing connection from the specified cluster 3 <= y <=50 (default is 5)
The text was updated successfully, but these errors were encountered: