Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Common Area and remove leader election #3964

Merged
merged 1 commit into from
Jul 18, 2022

Conversation

luolanzone
Copy link
Contributor

@luolanzone luolanzone commented Jul 4, 2022

  1. Antrea Multi-cluster will support only one leader cluster. There is
    no need to set up CommonArea manager and do leader election, so clean up
    remote CommonArea manager related codes and refactor member ClusterSet
    reconciler to handle connection/disconnection cases between member and
    leader clusters
  2. Remove unused ClusterSet webhook
  3. Add schema to validate the number of leader cluster in ClusterSet CR.

Signed-off-by: Lan Luo [email protected]

@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

@luolanzone luolanzone requested a review from jianjuns July 4, 2022 06:33
@luolanzone luolanzone added the area/multi-cluster Issues or PRs related to multi cluster. label Jul 4, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2022

Codecov Report

Merging #3964 (9a0f51a) into main (1b4d55e) will increase coverage by 1.56%.
The diff coverage is 50.93%.

❗ Current head 9a0f51a differs from pull request most recent head 9ac61cf. Consider uploading reports for the commit 9ac61cf to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3964      +/-   ##
==========================================
+ Coverage   64.34%   65.90%   +1.56%     
==========================================
  Files         294      308      +14     
  Lines       43634    44090     +456     
==========================================
+ Hits        28076    29058     +982     
+ Misses      13282    12705     -577     
- Partials     2276     2327      +51     
Flag Coverage Δ *Carryforward flag
e2e-tests 60.67% <69.79%> (?)
kind-e2e-tests 51.23% <38.03%> (+0.70%) ⬆️ Carriedforward from 23ef6f7
unit-tests 44.28% <15.80%> (-0.06%) ⬇️ Carriedforward from 23ef6f7

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
...ter/apis/multicluster/v1alpha1/clusterset_types.go 100.00% <ø> (ø)
...icluster/cmd/multicluster-controller/controller.go 55.17% <ø> (+46.83%) ⬆️
pkg/agent/agent_linux.go 5.36% <0.00%> (+0.22%) ⬆️
pkg/agent/cniserver/pod_configuration.go 54.59% <0.00%> (ø)
pkg/agent/proxy/endpoints.go 78.57% <ø> (ø)
pkg/agent/util/net.go 48.99% <0.00%> (-3.39%) ⬇️
pkg/agent/util/net_linux.go 29.12% <0.00%> (-5.08%) ⬇️
pkg/features/antrea_features.go 11.11% <ø> (ø)
pkg/ovs/ovsconfig/ovs_client.go 46.77% <25.92%> (-1.03%) ⬇️
pkg/agent/route/route_linux.go 47.11% <36.84%> (-2.56%) ⬇️
... and 78 more

@luolanzone luolanzone force-pushed the remove-leader-election branch from f2b291e to 5336ee7 Compare July 4, 2022 07:08
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e
/test-multicluster-integration

@luolanzone
Copy link
Contributor Author

/test-multicluster-integration

memberAnnounce.ClusterID)
data.leaderStatus.reason = ReasonNotElectedLeader
data.leaderStatus.reason = ReasonNotConnectedLeader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. Here, the cluster is a leader but not connected, or it is not a leader at all? We should revise either the message or the reason, depending on which one is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a temporary status indicates the cluster is not confirmed as the leader yet before validation on the leader is passed. It should be overwritten soon as long as the following validation codes passed. I renamed it as ReasonNotLeader.

@luolanzone luolanzone force-pushed the remove-leader-election branch 2 times, most recently from 3abf519 to a31e7f3 Compare July 6, 2022 08:35
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

@luolanzone luolanzone force-pushed the remove-leader-election branch from a31e7f3 to 9bbea2f Compare July 7, 2022 06:48
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

rc.Stop()
}
delete(r.remoteCommonAreas, clusterID)
}
case <-ticker.C:
r.RunLeaderElection()
// Check the leader connection status periodically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should revisit the relation between remoteCommonAreaManager and RemoteCommonArea. If we just have a single leader cluster, then we should just have one RemoteCommonArea? Why we still need the complex logics of multiple RemoteCommonAreas, and async updates between remoteCommonAreaManager and RemoteCommonArea?

Probably we should think about the implementation from scratch, rather than trying to minimize the changes based on the existing code that wants to cover multiple leaders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will check how to refine this part.

@luolanzone luolanzone force-pushed the remove-leader-election branch from 9bbea2f to 4e6bb04 Compare July 11, 2022 09:08
@luolanzone luolanzone changed the title Refactor leader election related codes as a checker Refactor common area and remove leader election Jul 11, 2022
@luolanzone luolanzone force-pushed the remove-leader-election branch from 4e6bb04 to 8837ba8 Compare July 11, 2022 09:16
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

1 similar comment
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

// Client grants read/write to the Namespace of the cluster that is backing this CommonArea.
client.Client

// GetClusterID returns the clusterID of the cluster accessed by this CommonArea.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "cluster ID of the leader cluster"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, comment is updated.

}
}

func (r *MemberClusterSetReconciler) updateActiveLeader(cluster commonarea.RemoteCommonArea) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should remove this func, and put the logics into checkLeaderConnection() directly, as the two cases (cluster == nil and cluster != nil) should share no common code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed now.

r.remoteCommonArea = cluster
if cluster != nil {
if err := cluster.StartWatching(); err != nil {
klog.ErrorS(err, "Failed to start watching events")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen? Should we retry StartWatching()?Should we set r.installedLeader.connected = true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part has been refined and moved to remote common area now.

remoteCommonAreas := r.remoteCommonAreaManager.GetRemoteCommonAreas()
if len(remoteCommonAreas) <= 0 {
return nil, "", errors.New("ClusterSet has not been set up properly, no available Common Area")
if r.remoteCommonArea.IsConnected() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is possible StartWatching() not called yet. Is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK, StartWatching() is for importer part. This GetRemoteCommonAreaAndLocalID is only used for exporter. Exporter can work as long as the client can talk to leader cluster.

@luolanzone luolanzone force-pushed the remove-leader-election branch from 8837ba8 to a3c2844 Compare July 12, 2022 04:01
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

// Client grants read/write to the Namespace of the cluster that is backing this CommonArea.
client.Client

// GetClusterID returns the clusterID of the leader cluster accessed by this CommonArea.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "accessed by this CommonArea" (do not understand what it means).

string(r.remoteCommonAreaManager.GetLocalClusterID()),
r.remoteCommonAreaManager.GetNamespace(),
string(r.GetLocalClusterID()),
r.localNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug?

Copy link
Contributor Author

@luolanzone luolanzone Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, RemoteCommonArea doesn't have local Cluster ID/Namespace information, we set it in RemoteCommonAreaManger before. Since I deleted the RemoteCommonAreaManger, the localNamespace is passed to RemoteCommonArea now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@luolanzone luolanzone force-pushed the remove-leader-election branch from a3c2844 to 1e65ddf Compare July 13, 2022 03:55
@luolanzone luolanzone added this to the Antrea v1.8 release milestone Jul 13, 2022
// MemberClusterSetReconciler reconciles a ClusterSet object in the member cluster deployment.
type MemberClusterSetReconciler struct {
client.Client
Scheme *runtime.Scheme
Namespace string
mutex sync.RWMutex
// commonAreaLock protects the access to RemoteCommonArea which presents the leader cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "which presents the leader cluster"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -250,12 +240,20 @@ func (r *MemberClusterSetReconciler) updateStatus() {
status := multiclusterv1alpha1.ClusterSetStatus{}
status.TotalClusters = int32(len(r.clusterSetConfig.Spec.Members))
status.ObservedGeneration = r.clusterSetConfig.Generation
status.ClusterStatuses = r.remoteCommonAreaManager.GetMemberClusterStatues()
status.ClusterStatuses = []multiclusterv1alpha1.ClusterStatus{}
if r.remoteCommonArea != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to lock from this line to line 251?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, changed the scope.

secretName := addedLeader.Secret
klog.InfoS("ClusterSet update", "old", currentCommonArea.GetClusterID(), "new", newLeader)
klog.InfoS("Stopping old RemoteCommonArea", "cluster", clusterID)
r.remoteCommonArea.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check error? I saw you checked Stop() error at line 95. Could you check Stop() can really fail or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consider currentCommonArea.Stop()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

klog.InfoS("Creating RemoteCommonArea", "cluster", clusterID)
// Read secret to access the leader cluster. Assume secret is present in the same Namespace as the ClusterSet.
secret, err := r.getSecretForLeader(secretName, clusterSet.GetNamespace())
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you change the block to:

if err != nil {
    klog.ErrorS(err, "Failed to get Secret to create RemoteCommonArea", "secret", secretName, "cluster", clusterID)
    return err
}
...
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@luolanzone luolanzone force-pushed the remove-leader-election branch from 1e65ddf to 750a42f Compare July 14, 2022 02:16
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e

jianjuns
jianjuns previously approved these changes Jul 15, 2022
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jianjuns
Copy link
Contributor

/test-all

@jianjuns jianjuns changed the title Refactor common area and remove leader election Refactor Common Area and remove leader election Jul 15, 2022
1. Antrea Multi-cluster will support only one leader cluster. There is
no need to set up common area manager and do leader election, so clean up
remote common area manager related codes and refactor member clusterset
reconciler to handle connection/disconnection cases between member and
leader clusters
2. Remove unused clusterset webhook
3. Add schema to valid the number of leader cluster in ClusterSet CR.

Signed-off-by: Lan Luo <[email protected]>
@luolanzone
Copy link
Contributor Author

/test-multicluster-e2e
/test-all

@jianjuns I fixed a go.mod issue and do a force push again, could you help to double check if this PR is Ok to merge?
I created another PR #4026 based on this, it would better if you take a look at both, I will do a rebase once this one is merged. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multi-cluster Issues or PRs related to multi cluster.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants