Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow management of the Envoy configuration parameter: close_connections_on_host_set_change. #9505

Closed
evsasha opened this issue May 21, 2024 · 1 comment · Fixed by #10226
Assignees
Labels
Good First Issue Good issue for newbies Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.18 Type: Enhancement New feature or request

Comments

@evsasha
Copy link

evsasha commented May 21, 2024

Gloo Edge Product

Open Source

Gloo Edge Version

v1.16.13

Is your feature request related to a problem? Please describe.

Feature Request: Solving the WebSocket Split Brain Problem.

In my scenario, multiple instances of Envoy Proxy serve several instances of the backend.
The backend operates using the WebSocket protocol.
Users connect to the backend and, via the Maglev load balancing protocol, consistently reach the same pod, regardless of which Envoy Proxy instance they connect through.

During a network failure, different Envoy instances start seeing a different number of backend instances.
WebSocket sessions get rebalanced to different pods, resulting in disrupted communication between users.

I expect that once the network issue is resolved and the pod is back in the load balancer, the sessions will automatically rebalance and once again route to a single pod.
However, this does not happen due to the default behavior.

This logic in Envoy is controlled by the configuration parameter close_connections_on_host_set_change, which is currently unavailable when using Gloo.

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#config-cluster-v3-cluster-commonlbconfig

Describe the solution you'd like

Add a configuration block to manage the Envoy parameter close_connections_on_host_set_change.

Describe alternatives you've considered

  1. Monitoring and Terminating Sessions on the Client Side: This requires implementation on the backend side and is potentially slower than an Envoy-side implementation.

  2. Using a Single Instance of Envoy Proxy: This would result in a loss of fault tolerance.

  3. Increasing Timeouts and the Number of Checks Before Removing a Pod from the Load Balancer: This reduces the number of incidents but also decreases the response time to a pod failure.

Additional Context

It might be worth considering the possibility of adding an entire configuration block for config.cluster.v3.Cluster.CommonLbConfig.

┆Issue is synchronized with this Asana task by Unito

@evsasha evsasha added the Type: Enhancement New feature or request label May 21, 2024
@nfuden nfuden added the Good First Issue Good issue for newbies label May 21, 2024
@sam-heilbron sam-heilbron added Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.18 labels Oct 21, 2024
@nfuden nfuden linked a pull request Oct 25, 2024 that will close this issue
4 tasks
@ryanrolds
Copy link
Contributor

@evsasha, the change should land in 1.18. While working on this it became clear that connections are not closed server side, only the connection pool is drained. Long-lived connections will remain open until they close themselves. Envoy will likely need to be enhanced to support server-side initiated closing (forceful or graceful depending on the protocol) of connections.

There are some issue about this option and long-lived connections:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Good issue for newbies Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.18 Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants