Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Xunzhuo · 2023-10-11T03:36:46Z

Description:

Describe the desired behavior, what scenario it enables and how it
would be used.

Envoy Gateway disabled leader election at default and didn`t expose it through envoy gateway config. When scaling cp replicas, Multiple EG will be writing status back for same resource.

We need to find out which is more expensive - enabling leader election and EG sending heartbeats to API server or multiple EG writing status back for same resource.

cc @envoyproxy/gateway-maintainers

[optional Relevant Links:]

Any extra documentation required to understand the issue.

zirain · 2023-10-11T07:37:37Z

Envoy Gateway disabled leader election at default and didn`t expose it through envoy gateway config.

this suprise me, IMO, EG should enable leader election.

github-actions · 2023-11-10T08:02:31Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg · 2023-11-10T18:49:26Z

imo we should enable leader election (by default) to make sure only 1 EG controller can

write status back
can create/write infra (envoy proxy fleet & rate limit service)

but we need to make sure all EGs can read/watch resources and generate resources so any Envoy proxy can connect to it to get xds to support data plane scale out

cc @Xunzhuo can you help with this since you raised #2123 :)

alexwo · 2024-02-19T17:02:28Z

Hi, I'm interested in contributing.
Could you please assign me to this issue?

alexwo · 2024-02-22T18:44:42Z

Greetings,

I would like to share a draft proposal that I've been considering for this enhancement. Below is an overview and a suggestion for a phased approach I propose we can take to integrate this feature smoothly, minimizing potential disruptions.

Phase 1: Foundation for Leader Election

Leader Election Mechanism: We'll integrate a leader election mechanism aligned with Kubernetes standards. This will identify and designate a leading EG pod to manage the xDS service, status updates, ensuring a single source of truth for configuration updates.
Flexible Configuration: To cater to diverse operational needs, we can introduce configurable parameters for managing the leader election process. These can include options for adjusting lease durations, renewal intervals, and retry intervals, with Kubernetes' default settings as our fallback.
Readiness Probes Adjustment: To reflect their status in the leader election hierarchy, we'll modify the readiness probes of our EG pods. Non-leader pods will be marked as "not ready" to prevent them from serving xDS requests, thereby maintaining configuration consistency and preventing potential misconfigurations for envoy proxies connected.

Phase 2: Expanded xDS Service Capability

Once we've established a stable leader election process, our next step can be to enable all replicas to serve the xDS service effectively over gRPC. This phase can focus on:

Ensuring Envoy Proxy Routing Consistency: It's crucial that each Envoy proxy receives updates from the same xDS instance to avoid configuration discrepancies. By maintaining a consistent connection to a designated Envoy xDS service endpoint, we can achieve this uniformity.
Multi-Replica xDS Service: We can introduce a new configuration option to allow xDS service distribution across multiple replicas. In this mode all xDS replicas are synchronized and updated for each resource change, with only the master replica handling the k8s resource status updates.

Feedback:

I am keen to hear your thoughts, insights, and any concerns you might have regarding this proposal.

arkodg · 2024-02-22T19:24:53Z

thanks for picking this one up and detailing your plans !

approach LGTM, my suggestion would be to not spend any extra efforts on Readiness Probes Adjustment and disable the other replicas , and instead invest it directly on Multi-Replica xDS Service ( which is how its designed today where kube-proxy is helping us spread the load today to different xDS servers ).
Only after 1 & 2, can we recommend users to confidently scale EG replicas

alexwo · 2024-02-22T22:28:42Z

@arkodg That sounds like a plan. We can move forward with implementing a comprehensive solution for xDS right from the start.

Is it still be beneficial to have an option for a deployment where only a single xDS instance is active & supported by a standby replica ?

This could offer better consistency and reliability, especially in scenarios where synchronization issues might lead to a replica becoming outdated.

arkodg · 2024-02-22T22:43:37Z

@arkodg That sounds like a plan. We can move forward with implementing a comprehensive solution for xDS right from the start.

Is it still be beneficial to have an option for a deployment where only a single xDS instance is active & supported by a standby replica ? This could offer better consistency and reliability, especially in scenarios where synchronization issues might lead to a replica becoming outdated.

we could consider Active/Passive control plane replica as Phase 3 :) , as an opt in, would be good to create a sub issue and get community feedback

alexwo · 2024-02-22T23:08:30Z

@arkodg Yes, sounds like phase 3 :) 👍

alexwo · 2024-03-05T15:52:52Z

I have prepared a PR, it's ready for review.

This use cases were manually validated:

Blocking the GRPC Port of a Leader (with IP Tables Rule)
Result: the Envoy proxy is able to discover the xDS rules from at least one of the remaining EGs.
Blocking the GRPC Port of the leader + all secondary pods except one
Result: Despite the widespread blockage, the Envoy proxy manages to obtain xDS rules from the remaining unblocked EG, maintaining operational resilience.
Apply k8s quickstart ( confirm only leader update status / infra)
Result: status updates and infrastructure are exclusively managed by the leader controller.
K8S control plane API is not available for the leader instance
The leader restarts, however, other instances continue to serve xDS and take leadership when possible

Best Regards,
Alex

github-actions · 2024-04-04T16:01:52Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg · 2024-04-04T16:04:14Z

completed with #2694

Xunzhuo added kind/enhancement New feature or request kind/decision A record of a decision made by the community. labels Oct 11, 2023

Xunzhuo mentioned this issue Oct 31, 2023

feat: enable leader-election for eg controllers #2123

Closed

github-actions bot added the stale label Nov 10, 2023

arkodg removed the stale label Nov 10, 2023

arkodg added help wanted Extra attention is needed and removed kind/decision A record of a decision made by the community. labels Nov 10, 2023

Xunzhuo added the road-to-ga label Dec 7, 2023

Xunzhuo added this to the v1.0.0-rc1 milestone Dec 7, 2023

github-project-automation bot added this to Envoy Gateway: The Road to GA Dec 7, 2023

github-project-automation bot moved this to Todo in Envoy Gateway: The Road to GA Dec 7, 2023

arkodg removed the road-to-ga label Feb 14, 2024

arkodg modified the milestones: v1.0.0-rc1, Backlog Feb 14, 2024

arkodg added the provider/kubernetes Issues related to the Kubernetes provider label Feb 14, 2024

arkodg assigned alexwo Feb 19, 2024

arkodg removed the help wanted Extra attention is needed label Feb 19, 2024

arkodg added the area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. label Feb 22, 2024

alexwo mentioned this issue Feb 25, 2024

feat(EG K8S Provider): Enable leader election for EG controller #2694

Merged

alexwo mentioned this issue Mar 14, 2024

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

Open

github-actions bot added the stale label Apr 4, 2024

arkodg closed this as completed Apr 4, 2024

github-project-automation bot moved this from Todo to Done in Envoy Gateway: The Road to GA Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Xunzhuo commented Oct 11, 2023 •

edited

Loading

zirain commented Oct 11, 2023

github-actions bot commented Nov 10, 2023

arkodg commented Nov 10, 2023 •

edited

Loading

alexwo commented Feb 19, 2024 •

edited

Loading

alexwo commented Feb 22, 2024 •

edited

Loading

arkodg commented Feb 22, 2024

alexwo commented Feb 22, 2024 •

edited

Loading

arkodg commented Feb 22, 2024

alexwo commented Feb 22, 2024 •

edited

Loading

alexwo commented Mar 5, 2024 •

edited

Loading

github-actions bot commented Apr 4, 2024

arkodg commented Apr 4, 2024

Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Comments

Xunzhuo commented Oct 11, 2023 • edited Loading

zirain commented Oct 11, 2023

github-actions bot commented Nov 10, 2023

arkodg commented Nov 10, 2023 • edited Loading

alexwo commented Feb 19, 2024 • edited Loading

alexwo commented Feb 22, 2024 • edited Loading

Greetings,

Phase 1: Foundation for Leader Election

Phase 2: Expanded xDS Service Capability

Feedback:

arkodg commented Feb 22, 2024

alexwo commented Feb 22, 2024 • edited Loading

arkodg commented Feb 22, 2024

alexwo commented Feb 22, 2024 • edited Loading

alexwo commented Mar 5, 2024 • edited Loading

github-actions bot commented Apr 4, 2024

arkodg commented Apr 4, 2024

Xunzhuo commented Oct 11, 2023 •

edited

Loading

arkodg commented Nov 10, 2023 •

edited

Loading

alexwo commented Feb 19, 2024 •

edited

Loading

alexwo commented Feb 22, 2024 •

edited

Loading

alexwo commented Feb 22, 2024 •

edited

Loading

alexwo commented Feb 22, 2024 •

edited

Loading

alexwo commented Mar 5, 2024 •

edited

Loading