Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale: Enabling leader election or Multiple EG writing status back for same resource. #1953

Closed
Xunzhuo opened this issue Oct 11, 2023 · 12 comments
Assignees
Labels
area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. kind/enhancement New feature or request provider/kubernetes Issues related to the Kubernetes provider stale
Milestone

Comments

@Xunzhuo
Copy link
Member

Xunzhuo commented Oct 11, 2023

Description:

Describe the desired behavior, what scenario it enables and how it
would be used.

Envoy Gateway disabled leader election at default and didn`t expose it through envoy gateway config. When scaling cp replicas, Multiple EG will be writing status back for same resource.

We need to find out which is more expensive - enabling leader election and EG sending heartbeats to API server or multiple EG writing status back for same resource.

cc @envoyproxy/gateway-maintainers

[optional Relevant Links:]

Any extra documentation required to understand the issue.

@Xunzhuo Xunzhuo added kind/enhancement New feature or request kind/decision A record of a decision made by the community. labels Oct 11, 2023
@zirain
Copy link
Member

zirain commented Oct 11, 2023

Envoy Gateway disabled leader election at default and didn`t expose it through envoy gateway config.

this suprise me, IMO, EG should enable leader election.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Nov 10, 2023
@arkodg arkodg removed the stale label Nov 10, 2023
@arkodg
Copy link
Contributor

arkodg commented Nov 10, 2023

imo we should enable leader election (by default) to make sure only 1 EG controller can

  • write status back
  • can create/write infra (envoy proxy fleet & rate limit service)

but we need to make sure all EGs can read/watch resources and generate resources so any Envoy proxy can connect to it to get xds to support data plane scale out

cc @Xunzhuo can you help with this since you raised #2123 :)

@arkodg arkodg added help wanted Extra attention is needed and removed kind/decision A record of a decision made by the community. labels Nov 10, 2023
@Xunzhuo Xunzhuo added this to the v1.0.0-rc1 milestone Dec 7, 2023
@arkodg arkodg removed the road-to-ga label Feb 14, 2024
@arkodg arkodg modified the milestones: v1.0.0-rc1, Backlog Feb 14, 2024
@arkodg arkodg added the provider/kubernetes Issues related to the Kubernetes provider label Feb 14, 2024
@alexwo
Copy link
Contributor

alexwo commented Feb 19, 2024

Hi, I'm interested in contributing.
Could you please assign me to this issue?

@arkodg arkodg removed the help wanted Extra attention is needed label Feb 19, 2024
@alexwo
Copy link
Contributor

alexwo commented Feb 22, 2024

Greetings,

I would like to share a draft proposal that I've been considering for this enhancement. Below is an overview and a suggestion for a phased approach I propose we can take to integrate this feature smoothly, minimizing potential disruptions.

Phase 1: Foundation for Leader Election

  • Leader Election Mechanism: We'll integrate a leader election mechanism aligned with Kubernetes standards. This will identify and designate a leading EG pod to manage the xDS service, status updates, ensuring a single source of truth for configuration updates.

  • Flexible Configuration: To cater to diverse operational needs, we can introduce configurable parameters for managing the leader election process. These can include options for adjusting lease durations, renewal intervals, and retry intervals, with Kubernetes' default settings as our fallback.

  • Readiness Probes Adjustment: To reflect their status in the leader election hierarchy, we'll modify the readiness probes of our EG pods. Non-leader pods will be marked as "not ready" to prevent them from serving xDS requests, thereby maintaining configuration consistency and preventing potential misconfigurations for envoy proxies connected.

Phase 2: Expanded xDS Service Capability

Once we've established a stable leader election process, our next step can be to enable all replicas to serve the xDS service effectively over gRPC. This phase can focus on:

  • Ensuring Envoy Proxy Routing Consistency: It's crucial that each Envoy proxy receives updates from the same xDS instance to avoid configuration discrepancies. By maintaining a consistent connection to a designated Envoy xDS service endpoint, we can achieve this uniformity.

  • Multi-Replica xDS Service: We can introduce a new configuration option to allow xDS service distribution across multiple replicas. In this mode all xDS replicas are synchronized and updated for each resource change, with only the master replica handling the k8s resource status updates.

Feedback:

I am keen to hear your thoughts, insights, and any concerns you might have regarding this proposal.

@arkodg
Copy link
Contributor

arkodg commented Feb 22, 2024

thanks for picking this one up and detailing your plans !

approach LGTM, my suggestion would be to not spend any extra efforts on Readiness Probes Adjustment and disable the other replicas , and instead invest it directly on Multi-Replica xDS Service ( which is how its designed today where kube-proxy is helping us spread the load today to different xDS servers ).
Only after 1 & 2, can we recommend users to confidently scale EG replicas

@arkodg arkodg added the area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. label Feb 22, 2024
@alexwo
Copy link
Contributor

alexwo commented Feb 22, 2024

@arkodg That sounds like a plan. We can move forward with implementing a comprehensive solution for xDS right from the start.

Is it still be beneficial to have an option for a deployment where only a single xDS instance is active & supported by a standby replica ?

This could offer better consistency and reliability, especially in scenarios where synchronization issues might lead to a replica becoming outdated.

@arkodg
Copy link
Contributor

arkodg commented Feb 22, 2024

@arkodg That sounds like a plan. We can move forward with implementing a comprehensive solution for xDS right from the start.

Is it still be beneficial to have an option for a deployment where only a single xDS instance is active & supported by a standby replica ? This could offer better consistency and reliability, especially in scenarios where synchronization issues might lead to a replica becoming outdated.

we could consider Active/Passive control plane replica as Phase 3 :) , as an opt in, would be good to create a sub issue and get community feedback

@alexwo
Copy link
Contributor

alexwo commented Feb 22, 2024

@arkodg Yes, sounds like phase 3 :) 👍

@alexwo
Copy link
Contributor

alexwo commented Mar 5, 2024

I have prepared a PR, it's ready for review.

This use cases were manually validated:

  1. Blocking the GRPC Port of a Leader (with IP Tables Rule)
    Result: the Envoy proxy is able to discover the xDS rules from at least one of the remaining EGs.

  2. Blocking the GRPC Port of the leader + all secondary pods except one
    Result: Despite the widespread blockage, the Envoy proxy manages to obtain xDS rules from the remaining unblocked EG, maintaining operational resilience.

  3. Apply k8s quickstart ( confirm only leader update status / infra)
    Result: status updates and infrastructure are exclusively managed by the leader controller.

  4. K8S control plane API is not available for the leader instance
    The leader restarts, however, other instances continue to serve xDS and take leadership when possible

Best Regards,
Alex

Copy link

github-actions bot commented Apr 4, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Apr 4, 2024
@arkodg
Copy link
Contributor

arkodg commented Apr 4, 2024

completed with #2694

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. kind/enhancement New feature or request provider/kubernetes Issues related to the Kubernetes provider stale
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants