-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high-latency connections experience report #9112
Comments
Support for load balancing through an L4 is generally not great with Envoy, for many of the reasons you mentioned. Some thoughts: I think a relatively easy feature to add to the outlier detection feature set is having the outlier detection not mark an upstream as unhealthy, but just close the connection. This would trigger a new connection to be created without marking the host as down. We do support concurrency limits on the H/2 stream, see An approach I've seen here when doing client side load balancing through an L4 LB is to periodically re-establish connections (say every 5 min) as a way to even out the connections. Events such as scaling up the number of upstream hosts tend to leave connections imbalanced until they are recycled in some way, and periodically recycling the connections achieves that even for connections with very low concurrency. Re active health checking, depending on the L4 LB in use a plain TCP connect health check might work better: if the LB fails the TCP handshake when all backend hosts are down, it would be a reasonable way to detect that the LB itself is unable to service the check, not any of the backing hosts. |
Thanks for the awesome write-up @bobby-stripe!
+1. This might be a little tricky as outlier detection is disjoint from the connection pool, but I think we could figure out a way to do it without major surgery. |
thanks @mattklein123 and @snowp!
I see under Another thought is that we could use HTTP/2 |
Summary: Implementing HTTP/2 connection pooling and connection-level outlier detection has the opportunity to improve Envoy’s resiliency with high-latency links and when used with transparent L4 load balancers.
We’ve been running Envoy to connect parts of our infrastructure together, as Dylan talked about in his EnvoyCon talk. I wanted to share a bit of an experience report to highlight a few issues we’ve hit along the way. I realize this isn't a traditional issue report - let me know if there is a better venue or format that would help.
Load Balancers
We use L4 load balancers as the gateways for new connections between isolated network segments, as in the diagram below:
In this example, the downstream envoy in VPC 1 can reach several upstream hosts in VPC 2 only through an L4 load balancer. This load balancer’s IP address is the single upstream in host 1 envoy’s cluster.
The major consequence of this is that Envoy active health checks are of limited utility - they can detect if there is a cluster-wide issue, but will only otherwise reflect the health of a single host in that cluster (unless
reuse_connection
is explicitly set to false, in which case each health check will be serviced by a random host).Few Upstreams
Related to the use of load balancers, we have cases where we have a limited number of upstreams in a cluster, often 1 or 2:
In a steady state we may split traffic between two upstream load balancers, but during a deploy (utilizing blue/green clusters) send all traffic through a single load balancer.
This has given us FUD around enabling outlier detection - if we’re routing all traffic to VPC 2 during a deploy to VPC 3 and a single host in VPC 2 (out of 12) is having an issue, we wouldn’t want the downstream host 1’s envoy to enter panic routing and split traffic between VPC 2 and VPC 3. Some of this may be inexperience with outlier detection (I would love to hear if there is a straightforward misunderstanding or way to work around this), but as outlier detection applies the results of its analysis of connections to upstreams, it feels like there is a fundamental mismatch here for us.
Another consequence that we have worked around here is that we end up with uneven request distribution across upstream hosts. We believe this is an interaction of having a small number of upstreams with the fact that by default new incoming connections to the downstream envoy are likely to end up on the same thread/event loop (as that thread is cache/scheduler hot and likely to win the
accept(2)
race). We are excited to test the connection balancer in Envoy 1.12. Our current workaround is listing upstreams multiple times in a cluster, so that each event loop will open multiple connections through the load balancer.One thing to note is that we use mTLS with client and server certificates for all connections, which means that when connecting through an L4 LB, the established connection can be tied to a specific upstream identity. It would be possible for a downstream Envoy to learn the members of an upstream cluster behind a load balancer over time. (This feels like either a great idea or a terrible idea).
Connection-level problems
When envoys on either end of long-lived HTTP/2 connections are in different data centers with tens or hundreds of milliseconds of latency between them, we regularly observe connection-level problems:
We most often see network issues where a small percentage of TCP connections between hosts in different regions are affected. I believe most commonly a single direction of the TCP stream is affected - either packets containing requests have a hard time getting from host 1 to host 2, or packets containing responses have a hard time getting from host 2 to host 1.
While a network issue is happening, active healthchecks typically keep passing. We run with
reuse_connection
set to false for healthchecks, and new connections rarely experience a problem (potentially new connections are routed around whatever the underlying physical problem is).Relatively quickly one side or the other of a connection will fill the kernel’s TCP write buffer, and envoy flow control will trip. I think Envoy will continue queueing new requests for upstreams that are paused.
Next Steps
A few things that have been discussed in the community already seem like they would significantly improve the situation here, so I want to highlight them here.
Connection-level outlier detection
Outlier detection should be able to mark a connection as unhealthy (in addition to an upstream host). I think the tricky part is figuring out how this interacts with existing upstream-level outlier detection. Maybe if the percentage of connections to an upstream triggering outlier detection passes a threshold, the upstream is marked as unhealthy. If that threshold is 0, you get the current behavior, but users could configure it higher for situations like ours.
HTTP/2 Connection pooling and per-connection concurrency limits
Ideally we would want to limit the blast radius of a bad HTTP/2 connection - one way to do this would be to have a concurrency limit on HTTP/2 connections. If we hit the limit (as an example: 500 outstanding requests/streams), we open a new connection. This is tracked in #7403.
The text was updated successfully, but these errors were encountered: