feat(EG K8S Provider): Enable leader election for EG controller #2694

alexwo · 2024-02-25T21:52:02Z

What type of PR is this?

feat(controller): implement leader election for EG controller

What this PR does / why we need it:

This PR introduces leader election for the Envoy Gateway (EG) controller to ensure that only one instance of the controller can perform write operations at any given time.
Ensure consistent availability of xDS (Envoy's discovery service) across the system.

The implementation follows best practices for distributed systems, enhancing the reliability and stability of the EG controller in multi-instance deployments.

Which issue(s) this PR fixes:

This enhancement facilitates the scaling of xDS services and enables leader election support.

High Level Changes

We don't block controllers startup as we set NeedLeaderElection flag to false by default, apply to GatewayAPIController.
The status handler controller postpones it's startup until it wins the leader election. After being elected, The EG GatewayAPIController initiates a reconciliation loop to process all pending statuses, ensuring that the status handler controller can pick up on all the resources and become up-to-date after assuming leadership.
The infra k8s provider is started after a pod is elected as leader, it directly will process all infra resources. no reconcile is required in this case.
It's possible to configure leader elections settings, to gain more control over how the leader election process happen, how quick, how resilient. Incase not provided those will default to k8s base values.

leaseDuration : 18s
renewDeadline: 10s
retryPeriod: 2s

This use cases were manually validated:

Blocking the GRPC Port of a Leader (with IP Tables Rule)
Result: the Envoy proxy is able to discover the xDS rules from at least one of the remaining EGs.

Blocking the GRPC Port of the leader + all secondary pods except one
Result: Despite the widespread blockage, the Envoy proxy manages to obtain xDS rules from the remaining unblocked EG, maintaining operational resilience.

Apply k8s quickstart ( confirm only leader update status / infra)
Result: status updates and infrastructure are exclusively managed by the leader controller.

K8S control plane API is not available for the leader instance
The leader restarts, however, other instances continue to serve xDS and take leadership when possible
#1953

codecov · 2024-02-26T02:59:18Z

Codecov Report

Attention: Patch coverage is 66.66667% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 66.46%. Comparing base (ec0f31b) to head (29da944).

Files	Patch %	Lines
internal/infrastructure/runner/runner.go	0.00%	17 Missing ⚠️
internal/provider/kubernetes/kubernetes.go	73.52%	6 Missing and 3 partials ⚠️
api/v1alpha1/envoygateway_helpers.go	85.71%	3 Missing ⚠️
internal/provider/kubernetes/controller.go	62.50%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2694   +/-   ##
=======================================
  Coverage   66.45%   66.46%           
=======================================
  Files         160      161    +1     
  Lines       22553    22632   +79     
=======================================
+ Hits        14988    15042   +54     
- Misses       6696     6717   +21     
- Partials      869      873    +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

guydc · 2024-02-29T13:05:46Z

Should this PR also include some work in the infrastructure controller?

alexwo · 2024-02-29T13:14:42Z

Should this PR also include some work in the infrastructure controller?
Hi @guydc , We can attempt to ensure that there is only a single controller that can create/update e.g., infra related resources at a time.

Yes, the side effect at this point is that infra updates will result in concurrent updates and extra revision conflict's / retrys.

alexwo · 2024-03-01T00:15:35Z

Should this PR also include some work in the infrastructure controller?
Hi @guydc , We can attempt to ensure that there is only a single controller that can create/update e.g., infra related resources at a time.

The side effect at this point is that infra updates will result in concurrent updates and extra revision conflict's / retrys.

Added changes to ensure that infra will be created only by the elected EG instance.

shawnh2 · 2024-03-05T16:27:32Z

internal/infrastructure/runner/runner.go

+				return
+			case <-r.Elected:
+				initInfra()
+				r.Logger.Info("started")


this log can also be moved to initInfra func

shawnh2 · 2024-03-05T16:30:08Z

internal/envoygateway/config/config.go

 	}, nil
 }

+func parseEnvDuration(envVar string, duration **time.Duration) error {


can we also make duration as a return value instead of using secondary pointer? (more like a gopher this way.

shawnh2 · 2024-03-05T16:33:11Z

internal/provider/kubernetes/kubernetes.go

@@ -73,6 +79,13 @@ func New(cfg *rest.Config, svr *config.Server, resources *message.ProviderResour
 		return nil, fmt.Errorf("unable to set up ready check: %w", err)
 	}

+	// Emit elected & continue with deployment of infra resources
+	go func() {
+		if <-mgr.Elected(); true {


Suggested change

if <-mgr.Elected(); true {

if <-mgr.Elected() {

Switched to using <-mgr.Elected(), given that it's a channel of type struct{}.

alexwo · 2024-03-06T15:17:21Z

/retest

alexwo · 2024-03-27T19:06:41Z

/retest

arkodg · 2024-03-29T12:43:11Z

charts/gateway-helm/values.tmpl.yaml

@@ -1,5 +1,10 @@
 deployment:
  envoyGateway:
+    leaderElection:
+      enabled: true


this doesnt look like the right way to set these field, suggest creating a field in the Kubernetes section of EnvoyGateway config

gateway/api/v1alpha1/envoygateway_types.go

Line 183 in deea895

type EnvoyGatewayKubernetesProvider struct {

also we dont need an explicit enabled field, can use the instantiation of a leaderElection ptr type to decide whether to enable it or not

This will then need to be passed to the provider via

gateway/internal/provider/runner/runner.go

Line 20 in deea895

type Config struct {

this config need to be removed

and moved under config.envoyGateway

arkodg · 2024-03-29T12:45:39Z

tools/make/kube.mk

@@ -113,6 +113,7 @@ install-ratelimit:
 	kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system
 	kubectl rollout status --watch --timeout=5m -n envoy-gateway-system deployment/envoy-gateway
 	kubectl wait --timeout=5m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
+	tools/hack/deployment-exists.sh "app.kubernetes.io/name=envoy-ratelimit" "envoy-gateway-system"


isn't this doing the same thing as L117 ?

The kubectl wait command isn't effective for deployment resources that haven't been created yet.
Therefore, it's necessary for the envoy-rate limit deployment to be in place before we can utilize kubectl wait to ensure it's in the available state.

Since the Envoy gateway requires additional time now to start up as leader and manage the rate limit deployment, this addon serves to mitigate that delay.

arkodg · 2024-03-29T12:47:06Z

internal/provider/runner/runner.go

@@ -43,7 +44,7 @@ func (r *Runner) Start(ctx context.Context) (err error) {
 		if err != nil {
 			return fmt.Errorf("failed to get kubeconfig: %w", err)
 		}
-		p, err := kubernetes.New(cfg, &r.Config.Server, r.ProviderResources)
+		p, err := kubernetes.New(cfg, &r.Config.Server, r.ProviderResources, r.Elected)


this won't be needed, and should be available in r.Config.Server

arkodg · 2024-03-29T12:52:21Z

internal/infrastructure/runner/runner.go

 	}

-	r.Logger.Info("started")
+	// Wait for leader if leader election is enabled


maybe adding more explicit comments that highlight what and why this is happening - "block, and only allow the leader to create infrastructure resources"

internal/provider/kubernetes/kubernetes.go

arkodg · 2024-03-29T12:57:38Z

finally got to this, thanks for building this out ! adding some comments

alexwo · 2024-03-29T14:00:58Z

finally got to this, thanks for building this out ! adding some comments

Thanks!

alexwo · 2024-03-29T19:55:01Z

/retest

arkodg · 2024-04-01T07:59:14Z

api/v1alpha1/envoygateway_types.go

+}
+
+type LeaderElection struct {
+	// LeaseDuration defines the time non-leader contenders will wait before attempting to claim leadership. It's based on the timestamp of the last acknowledged signal. The default setting is 15 seconds.


can you follow the kubebuilder way of definition defaults, and also add optional tags
this may not be a CRD, but it helps with doc generation

arkodg · 2024-04-01T08:01:07Z

charts/gateway-helm/templates/envoy-gateway-deployment.yaml

@@ -42,6 +42,14 @@ spec:
        - server
        - --config-path=/config/envoy-gateway.yaml
        env:
+        - name: ENVOY_GATEWAY_LEADER_ELECTION_ENABLED


needs to be removed

@arkodg this allows to disable leader election? or we actually don't want to allow to configure it via helm at all? only via envoy spec?

added a related note below

api/v1alpha1/envoygateway_helpers.go

internal/cmd/server.go

internal/envoygateway/config/config.go

arkodg · 2024-04-01T09:46:33Z

internal/provider/kubernetes/controller.go

@@ -901,6 +901,13 @@ func (r *gatewayAPIReconciler) addFinalizer(ctx context.Context, gc *gwapiv1.Gat

 // watchResources watches gateway api resources.
 func (r *gatewayAPIReconciler) watchResources(ctx context.Context, mgr manager.Manager, c controller.Controller) error {
+	// trigger a reconcile after getting elected


curious why this is needed

Upon its initial startup, the Envoy gateway will not assume the leader role but will begin reconciling resources, as the main loop of the controller does not require a leader to start processing.

Once the controller is elected as the leader, it will not revisit resources it has previously reconciled. This approach is necessary to update the status of resources and manage the infra resources, such as creating eg proxies after a leader is elected.

but it doesn't need to revisit or update status, unless some resource has changed, and in that case the Reconcile would be triggered

This will ensure that we reconcile status/infra for resources that have changed before there is an active elected leader.

Use cases:

Initial startup, when different eg resources can already exists in the cluster.

Leader goes down during an outage, it takes sometime for the new one to become a leader, and resource changes can happen during this time.

EG upgrades / re-deployments.

etc..

It may take over few seconds for a leader to assume leadership and during this time resource status/infra changes won't be handled, this protects against it.

init - still trying to understand this use case, since the logic in New that waits for a manager to elected for the first time should handle this case

leader down - k8s should handle this case and re-elect a new leader, if there's a delay its worth debugging this in k8s

@arkodg
k8s won't trigger a reconcile after a leader gets elected. In this impl we trigger it on our end, as we constantly reconcile resources for xds, even when we are not leaders.

We can decide not do this (reconcile resources only when we are leaders) but than we won't be able to serve xds from the non leader eg instances as they won't be in sync.

The main reconcile loop does not wait for leader election, it's start immediately on all pods.

gateway/internal/provider/kubernetes/controller.go

Line 136 in 1f91c08

func (r *gatewayAPIReconciler) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {

As there is no option to know which of the pods will get elected, the reconcile loop starts in all instances simultaneously. only the status / infra components start on leader election, so we have to replay for them the events they have missed.

Is not this expected ? the pod has to wait a lease duration timeout before it attempts to be a leader, as well k8s won't trigger a second full reconcile after a leader gets elected, as in our case the main reconcile loop is already processing items in all pods, including the one that will become the new leader.
https://github.com/kubernetes/client-go/blob/v0.29.3/tools/leaderelection/leaderelection.go#L133

// LeaseDuration is the duration that non-leader candidates will // wait to force acquire leadership. This is measured against time of // last observed ack. // // A client needs to wait a full LeaseDuration without observing a change to // the record before it can attempt to take over.

cool, thanks for explaining, it sounds like, we kick the method again to replay the writes, specifically for resources that were updated/created during the leaderless interval ? is it a common pattern in most CNCF projects ?

Yes that's right, our goal is to close the leaderless interval.

The typical approach in Kubernetes and CNCF ecosystems is probably a specific controller per flow as one for xDS sync, which focuses solely on xDS operations without altering status or infrastructure. Additionally, a separate controller can be used to perform status updates, and make infrastructure modifications.

Based on this, we can outline that the xDS controller does not need to assume a leadership role, whereas the controller handling status and infrastructure updates does. Consequently, Kubernetes will initiate the reconciliation process. So in general there will be 2 very similar reconcile loops -> xds for non leaders & leaders, extra one for leaders only.

While it's feasible to develop a distinct controller for managing xDS reconciliations and another solely for infrastructure and status modifications, such an overhaul might be extensive and potentially less appealing within the existing EG architecture.

sounds good, lets stick with this, can you also add a comment, explaining the why e.g. we kick the method again "Retrigger a reconcile after getting elected to replay the translation ( which will trigger status writes and/or infra resource creation) , specifically for resources that were updated/created during the leaderless interval"

Signed-off-by: Alexander Volchok <[email protected]>

internal/provider/kubernetes/kubernetes_test.go

arkodg · 2024-04-03T13:19:56Z

api/v1alpha1/envoygateway_types.go

+	// Default is 15 seconds.
+	RetryPeriod *time.Duration `json:"retryPeriod,omitempty"`
+	// Disabled allows to disable leader election (enabled by default)
+	Disabled *bool `json:"disabled,omitempty"`


we are using Disable in other APIs, lets use Disable here
e.g.

gateway/api/v1alpha1/envoygateway_metrics_types.go

Line 47 in 0f87fcb

Disable bool `json:"disable,omitempty"`

indeed i will change it

Signed-off-by: Alexander Volchok <[email protected]>

api/v1alpha1/envoygateway_helpers.go

arkodg · 2024-04-03T13:53:10Z

api/v1alpha1/envoygateway_helpers.go

+		r.Kubernetes.LeaderElection = DefaultLeaderElection()
+	}
+
+	r.Kubernetes.LeaderElection = DefaultLeaderElection()


why is this needed ?

not needed cleaning it up.

Signed-off-by: Alexander Volchok <[email protected]>

arkodg · 2024-04-03T14:08:39Z

api/v1alpha1/envoygateway_helpers.go

@@ -195,6 +212,9 @@ func (r *EnvoyGatewayProvider) GetEnvoyGatewayKubeProvider() *EnvoyGatewayKubern

 	if r.Kubernetes == nil {
 		r.Kubernetes = DefaultEnvoyGatewayKubeProvider()
+		if r.Kubernetes.LeaderElection == nil {


what about the case where r.Kubernetes is not nil, but r.Kubernetes.LeaderElection is nil

Signed-off-by: Alexander Volchok <[email protected]>

alexwo · 2024-04-03T19:28:48Z

/retest

Signed-off-by: Alexander Volchok <[email protected]>

…nto leader-election Signed-off-by: Alexander Volchok <[email protected]>

Signed-off-by: Alex Volchok <[email protected]>

Signed-off-by: Alexander Volchok <[email protected]>

alexwo · 2024-04-04T07:00:07Z

/retest

arkodg

LGTM, great to see this feature making it in
thanks for addressing all the feedback !

zhaohuabing

LGTM. Thank you for all the effort and patience!

alexwo changed the title ~~feat(EG K8S Provider): implement leader election for EG controller~~ feat(EG K8S Provider): Implement leader election for EG controller Mar 3, 2024

alexwo changed the title ~~feat(EG K8S Provider): Implement leader election for EG controller~~ feat(EG K8S Provider): Enable leader election for EG controller Mar 3, 2024

alexwo marked this pull request as ready for review March 5, 2024 15:53

alexwo requested a review from a team as a code owner March 5, 2024 15:53

shawnh2 reviewed Mar 5, 2024

View reviewed changes

arkodg added this to the Backlog milestone Mar 5, 2024

guydc mentioned this pull request Mar 6, 2024

Handle EnvoyGateway control plane version upgrades #1713

Closed

arkodg reviewed Mar 29, 2024

View reviewed changes

internal/provider/kubernetes/kubernetes.go Show resolved Hide resolved

arkodg reviewed Apr 1, 2024

View reviewed changes

api/v1alpha1/envoygateway_helpers.go Show resolved Hide resolved

arkodg reviewed Apr 1, 2024

View reviewed changes

internal/cmd/server.go Outdated Show resolved Hide resolved

arkodg reviewed Apr 1, 2024

View reviewed changes

internal/envoygateway/config/config.go Outdated Show resolved Hide resolved

arkodg reviewed Apr 1, 2024

View reviewed changes

alexwo added 2 commits April 1, 2024 13:22

merge with main

85ebae7

Signed-off-by: Alexander Volchok <[email protected]>

fixing lint

1f91c08

Signed-off-by: Alexander Volchok <[email protected]>

liorokman mentioned this pull request Apr 1, 2024

tls minversion not being respected in ClientTrafficPolicy #3060

Closed