Allow management of the Envoy configuration parameter: close_connections_on_host_set_change
.
#9505
Labels
Good First Issue
Good issue for newbies
Prioritized
Indicating issue prioritized to be worked on in RFE stream
release/1.18
Type: Enhancement
New feature or request
Gloo Edge Product
Open Source
Gloo Edge Version
v1.16.13
Is your feature request related to a problem? Please describe.
Feature Request: Solving the WebSocket Split Brain Problem.
In my scenario, multiple instances of Envoy Proxy serve several instances of the backend.
The backend operates using the WebSocket protocol.
Users connect to the backend and, via the Maglev load balancing protocol, consistently reach the same pod, regardless of which Envoy Proxy instance they connect through.
During a network failure, different Envoy instances start seeing a different number of backend instances.
WebSocket sessions get rebalanced to different pods, resulting in disrupted communication between users.
I expect that once the network issue is resolved and the pod is back in the load balancer, the sessions will automatically rebalance and once again route to a single pod.
However, this does not happen due to the default behavior.
This logic in Envoy is controlled by the configuration parameter
close_connections_on_host_set_change
, which is currently unavailable when using Gloo.https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#config-cluster-v3-cluster-commonlbconfig
Describe the solution you'd like
Add a configuration block to manage the Envoy parameter
close_connections_on_host_set_change
.Describe alternatives you've considered
Monitoring and Terminating Sessions on the Client Side: This requires implementation on the backend side and is potentially slower than an Envoy-side implementation.
Using a Single Instance of Envoy Proxy: This would result in a loss of fault tolerance.
Increasing Timeouts and the Number of Checks Before Removing a Pod from the Load Balancer: This reduces the number of incidents but also decreases the response time to a pod failure.
Additional Context
It might be worth considering the possibility of adding an entire configuration block for config.cluster.v3.Cluster.CommonLbConfig.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: