Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

Closed
ming-ddtechcg opened this issue Sep 18, 2020 · 8 comments · Fixed by #902, #938 or #955
Assignees
Labels
bug Something isn't working datapath Datapath related issues or enhancements enhancement New feature or request QE Labels related to QE handling items service-discovery Related to lighthouse
Milestone

Comments

@ming-ddtechcg
Copy link

I am running v0.6.1 on the two Openshift 4.4.20 clusters with "--cable-driver libreswan" while they joined together, say "cluster A" and "cluster B".

Deployed the test nginx-unprivileged case from "KIND (Local Environment)" to both clusters and expose services with kubectl and subctl. Performed nslookup with the FQDN service name in the pod "tmp-shell", and two service IP addresses are returned as round-robin.

The test scenario is that to shut down the cluster A (I bring down all worker nodes) and the service will be routed to the cluster B only:

the result of the testing to return the cluster B service IP took approximately 25 minutes (the timings were more or less sometimes).

There was the cluster B Gateway describe while the cluster A was down after 10 minutes:

$ kubectl describe Gateway -n submariner-operator  
Name:         clusterb-62kfm-worker-0-2pvhw
Namespace:    submariner-operator
Labels:       <none>
Annotations:  update-timestamp: 1600352166
API Version:  submariner.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:  2020-09-17T13:06:49Z
  Generation:          2
  Resource Version:    5746231
  Self Link:           /apis/submariner.io/v1/namespaces/submariner-operator/gateways/clusterb-62kfm-worker-0-2pvhw
  UID:                 0886e928-8706-4103-a3cb-5cf1a8390e51
Status:
  Connections:
    Endpoint:
      Backend:      libreswan
      cable_name:   submariner-cable-cluster-clustera-10-5-33-224
      cluster_id:   cluster-a
      Hostname:     worker0.clustera.test.com
      nat_enabled:  false
      private_ip:   10.5.33.224
      public_ip:    
      Subnets:
        172.30.0.0/16
        10.128.0.0/16
    Status:          connected
    Status Message:  
  Ha Status:         active
  Local Endpoint:
    Backend:      libreswan
    cable_name:   submariner-cable-cluster-b-10-5-71-225
    cluster_id:   cluster-c16
    Hostname:     clusterb-62kfm-worker-0-2pvhw
    nat_enabled:  false
    private_ip:   10.5.71.225
    public_ip:    
    Subnets:
      172.31.0.0/16
      10.132.0.0/14
  Status Failure:  
  Version:         v0.6.0
Events:            <none>

Recommendation:

If the lighthouse DNS can remove the down cluster service IP address off in 5 to 10 seconds (assume the good networking environment).

Or, we leave the options to users to configure it:

To define the parameter (--ha-heartbeat-interval x --ha-heartbeat-failure-count y) to declare the heartbeat ping failure between clusters while performing the subctl join. Like:

--ha-heartbeat-interval
it is a timer for how frequent sending the heartbeat from a health cluster to other joined clusters in x seconds. the range is 1 second <= x <= 600 seconds (default is 30 seconds)

--ha-heartbeat-failure-count
the number of the consecutive failure to declare the losing connection from the specified cluster 3 <= y <=50 (default is 5)

@ming-ddtechcg ming-ddtechcg added the enhancement New feature or request label Sep 18, 2020
@sridhargaddam
Copy link
Member

Thanks for reporting the issue @ming-ddtechcg

@mangelajo mangelajo added bug Something isn't working datapath Datapath related issues or enhancements service-discovery Related to lighthouse labels Sep 25, 2020
@mangelajo
Copy link
Contributor

I've added some more labels, I think this should also be considered as bug

aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Oct 27, 2020
*HealthCheck Ip is added in EndpointSpec
*Latency is added in GatewayStatus

Fixes: submariner-io#821
Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 6, 2020
*HealthCheck Ip is added in EndpointSpec
*Latency is added in GatewayStatus

Fixes: submariner-io#821
Signed-off-by: Aswin Surayanarayanan <[email protected]>
tpantelis pushed a commit that referenced this issue Nov 10, 2020
*HealthCheck Ip is added in EndpointSpec
*Latency is added in GatewayStatus

Fixes: #821
Signed-off-by: Aswin Surayanarayanan <[email protected]>
@nyechiel
Copy link
Member

nyechiel commented Nov 10, 2020

@aswinsuryan before we move this to done, what's the test coverage for this? Is it part of our E2E? Also, what about docs?

@aswinsuryan
Copy link
Contributor

@aswinsuryan before we move this to done, what's the test coverage for this? Is it part of our E2E? Also, what about docs?

Oh it was auto-closed with API patch merge. The implementation is yet to be added and UT and E2E. I will raise two separate issues for doc and e2e.

aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 11, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 11, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 12, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 12, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
@aswinsuryan aswinsuryan reopened this Nov 19, 2020
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 23, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 24, 2020
*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
aswinsuryan added a commit that referenced this issue Nov 26, 2020
* Add health checker implementation

*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: #821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
Co-authored-by: Thomas Pantelis <[email protected]>
@aswinsuryan aswinsuryan reopened this Nov 26, 2020
@aswinsuryan aswinsuryan added the QE Labels related to QE handling items label Nov 26, 2020
@aswinsuryan aswinsuryan linked a pull request Nov 26, 2020 that will close this issue
@mangelajo
Copy link
Contributor

@aswinsuryan Isn't this ready in 0.8-rc1 already?

@mangelajo mangelajo added this to the 0.8.0 milestone Dec 16, 2020
@mangelajo
Copy link
Contributor

ah, just waiting for verification?

@nyechiel
Copy link
Member

ah, just waiting for verification?

Yes, we want @manosnoam to review and verify.

@manosnoam
Copy link
Contributor

@nyechiel @aswinsuryan
I've opened E2E test request for this feature: #1041

novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
*HealthCheck Ip is added in EndpointSpec
*Latency is added in GatewayStatus

Fixes: submariner-io/submariner#821
Signed-off-by: Aswin Surayanarayanan <[email protected]>
novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
* Add health checker implementation

*Populate the healthchecker Ip
*Ping the healthchecker IP to check the remote
gateway status
*Update the gateway satus if the ping fails

Fixes: submariner-io/submariner#821

Signed-off-by: Aswin Surayanarayanan <[email protected]>
Co-authored-by: Thomas Pantelis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datapath Datapath related issues or enhancements enhancement New feature or request QE Labels related to QE handling items service-discovery Related to lighthouse
Projects
None yet
6 participants