HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

ming-ddtechcg · 2020-09-18T17:59:49Z

I am running v0.6.1 on the two Openshift 4.4.20 clusters with "--cable-driver libreswan" while they joined together, say "cluster A" and "cluster B".

Deployed the test nginx-unprivileged case from "KIND (Local Environment)" to both clusters and expose services with kubectl and subctl. Performed nslookup with the FQDN service name in the pod "tmp-shell", and two service IP addresses are returned as round-robin.

The test scenario is that to shut down the cluster A (I bring down all worker nodes) and the service will be routed to the cluster B only:

the result of the testing to return the cluster B service IP took approximately 25 minutes (the timings were more or less sometimes).

There was the cluster B Gateway describe while the cluster A was down after 10 minutes:

$ kubectl describe Gateway -n submariner-operator  
Name:         clusterb-62kfm-worker-0-2pvhw
Namespace:    submariner-operator
Labels:       <none>
Annotations:  update-timestamp: 1600352166
API Version:  submariner.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:  2020-09-17T13:06:49Z
  Generation:          2
  Resource Version:    5746231
  Self Link:           /apis/submariner.io/v1/namespaces/submariner-operator/gateways/clusterb-62kfm-worker-0-2pvhw
  UID:                 0886e928-8706-4103-a3cb-5cf1a8390e51
Status:
  Connections:
    Endpoint:
      Backend:      libreswan
      cable_name:   submariner-cable-cluster-clustera-10-5-33-224
      cluster_id:   cluster-a
      Hostname:     worker0.clustera.test.com
      nat_enabled:  false
      private_ip:   10.5.33.224
      public_ip:    
      Subnets:
        172.30.0.0/16
        10.128.0.0/16
    Status:          connected
    Status Message:  
  Ha Status:         active
  Local Endpoint:
    Backend:      libreswan
    cable_name:   submariner-cable-cluster-b-10-5-71-225
    cluster_id:   cluster-c16
    Hostname:     clusterb-62kfm-worker-0-2pvhw
    nat_enabled:  false
    private_ip:   10.5.71.225
    public_ip:    
    Subnets:
      172.31.0.0/16
      10.132.0.0/14
  Status Failure:  
  Version:         v0.6.0
Events:            <none>

Recommendation:

If the lighthouse DNS can remove the down cluster service IP address off in 5 to 10 seconds (assume the good networking environment).

Or, we leave the options to users to configure it:

To define the parameter (--ha-heartbeat-interval x --ha-heartbeat-failure-count y) to declare the heartbeat ping failure between clusters while performing the subctl join. Like:

--ha-heartbeat-interval
it is a timer for how frequent sending the heartbeat from a health cluster to other joined clusters in x seconds. the range is 1 second <= x <= 600 seconds (default is 30 seconds)

--ha-heartbeat-failure-count
the number of the consecutive failure to declare the losing connection from the specified cluster 3 <= y <=50 (default is 5)

The text was updated successfully, but these errors were encountered:

sridhargaddam · 2020-09-19T15:23:02Z

Thanks for reporting the issue @ming-ddtechcg

mangelajo · 2020-09-25T10:07:32Z

I've added some more labels, I think this should also be considered as bug

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: #821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

nyechiel · 2020-11-10T13:05:27Z

@aswinsuryan before we move this to done, what's the test coverage for this? Is it part of our E2E? Also, what about docs?

aswinsuryan · 2020-11-10T13:10:00Z

@aswinsuryan before we move this to done, what's the test coverage for this? Is it part of our E2E? Also, what about docs?

Oh it was auto-closed with API patch merge. The implementation is yet to be added and UT and E2E. I will raise two separate issues for doc and e2e.

*Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

* Add health checker implementation *Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: #821 Signed-off-by: Aswin Surayanarayanan <[email protected]> Co-authored-by: Thomas Pantelis <[email protected]>

mangelajo · 2020-12-16T16:38:05Z

@aswinsuryan Isn't this ready in 0.8-rc1 already?

mangelajo · 2020-12-16T16:38:26Z

ah, just waiting for verification?

nyechiel · 2020-12-16T20:01:55Z

ah, just waiting for verification?

Yes, we want @manosnoam to review and verify.

manosnoam · 2020-12-17T10:48:09Z

@nyechiel @aswinsuryan
I've opened E2E test request for this feature: #1041

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io/submariner#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

* Add health checker implementation *Populate the healthchecker Ip *Ping the healthchecker IP to check the remote gateway status *Update the gateway satus if the ping fails Fixes: submariner-io/submariner#821 Signed-off-by: Aswin Surayanarayanan <[email protected]> Co-authored-by: Thomas Pantelis <[email protected]>

ming-ddtechcg added the enhancement New feature or request label Sep 18, 2020

mangelajo added bug Something isn't working datapath Datapath related issues or enhancements service-discovery Related to lighthouse labels Sep 25, 2020

nyechiel assigned aswinsuryan Oct 14, 2020

aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Oct 27, 2020

Add fields for HealthCheck

1d2bf0c

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

aswinsuryan mentioned this issue Oct 27, 2020

Add fields for HealthCheck #902

Merged

aswinsuryan added a commit to aswinsuryan/submariner that referenced this issue Nov 6, 2020

Add fields for HealthCheck

664ee69

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

tpantelis closed this as completed in #902 Nov 10, 2020

tpantelis pushed a commit that referenced this issue Nov 10, 2020

Add fields for HealthCheck

365f238

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: #821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

This was referenced Nov 11, 2020

E2e : HA: With "cluster to cluster heartbeats" to identify the cluster down #937

Closed

Doc update to include the health checker details submariner-io/submariner-website#348

Closed

aswinsuryan mentioned this issue Nov 11, 2020

Add health checker implementation #938

Merged

aswinsuryan reopened this Nov 19, 2020

aswinsuryan closed this as completed in #938 Nov 26, 2020

aswinsuryan reopened this Nov 26, 2020

aswinsuryan added the QE Labels related to QE handling items label Nov 26, 2020

aswinsuryan linked a pull request Nov 26, 2020 that will close this issue

Change the statistics type from float64 to uint64 #955

Merged

mangelajo added this to the 0.8.0 milestone Dec 16, 2020

manosnoam mentioned this issue Dec 17, 2020

[E2E] Test "cluster to cluster heartbeats" updates the lighthouse DNS when cluster is down #1041

Closed

manosnoam closed this as completed Dec 17, 2020

novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023

Add fields for HealthCheck

6a81439

*HealthCheck Ip is added in EndpointSpec *Latency is added in GatewayStatus Fixes: submariner-io/submariner#821 Signed-off-by: Aswin Surayanarayanan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

ming-ddtechcg commented Sep 18, 2020

sridhargaddam commented Sep 19, 2020

mangelajo commented Sep 25, 2020

nyechiel commented Nov 10, 2020 •

edited

Loading

aswinsuryan commented Nov 10, 2020

mangelajo commented Dec 16, 2020

mangelajo commented Dec 16, 2020

nyechiel commented Dec 16, 2020

manosnoam commented Dec 17, 2020

HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

HA: With "cluster to cluster heartbeats" to identify the cluster down and update the lighthouse DNS cache as soon as possible #821

Comments

ming-ddtechcg commented Sep 18, 2020

sridhargaddam commented Sep 19, 2020

mangelajo commented Sep 25, 2020

nyechiel commented Nov 10, 2020 • edited Loading

aswinsuryan commented Nov 10, 2020

mangelajo commented Dec 16, 2020

mangelajo commented Dec 16, 2020

nyechiel commented Dec 16, 2020

manosnoam commented Dec 17, 2020

nyechiel commented Nov 10, 2020 •

edited

Loading