-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate worker connection accept balance #4602
Comments
cc @tonya11en |
This can be broken out into a bunch of different tasks, so we can probably have multiple assignees here. I'll volunteer myself to be the first. An important first step will be to get per-worker stats, so that's where I'll begin working. From there we'll have a basis to begin a more thorough investigation into the performance impact of the items above. |
@tonya11en I'm currently working on per-worker stats for event loop duration. Have you started on any of the others? I can take some of these on as well. |
@htuch I haven't started work on any of the others yet. Happy to help review. |
FYI for those watching I'm chasing an internal performance problem that may be related to this so I will be working on this. |
For anyone watching that cares about this issue, I have hacked up code that adds per-worker connection and watchdog miss stats and also re-balances active TCP connections across listeners. I plan on cleaning this up and upstreaming it, but I'm not sure exactly when I will get to it. If anyone urgently wants to work on this let me know. My WIP branch is here: https://github.com/envoyproxy/envoy/tree/cx_rebalance |
@mattklein123 would you be able to share the symptoms of internal performance problem that you think are related to this if it is some thing that can be shared? The reason I am asking that is because we are doing some perf tests of a low latency gRPC service and we see that p99 latencies increase as the req/min increase and at the same we also see the p99 connect_time_ms to upstream cluster increases. I was wondering could it be due to high number of worker threads (we left the concurrency arg to default) or could be related to this specific issue? We have n't enabled the dispatcher stats yet. Any thoughts? |
@ramaraochavali we are chasing a very similar issue. We have a gRPC service which has many downstream callers (so a large number of incoming mesh connections) and a single upstream callee. At P99 the latency do the upstream service is almost double what the upstream service thinks its downstream latency is. I've been looking at this for days and I can't find any issue here in terms of blocking or anything other than straight up load, but the delta is pretty large, like 30-40ms in some cases, which is quite large. The only thing I really have right now is this is a straight up load issue where Envoy and the service are context switching back and forth, and Envoy has to deal with a large number of individual connections, and at P99 there are interleavings that just go bad. I'm still looking though. |
I forgot to add that balancing the connections and increasing the workers does not seem to improve upstream P99 latency, which is why I think there may be scheduling contention with the service. |
@mattklein123 Thanks a lot for the details. This is very useful information. I will share here if we find any thing different from what you have found bases on our tests/tweaks to the config. |
@oschaaf see the above thread if you have time. Do we have any NH simulations in which we try to model this type of service mesh use case? The characteristics would be:
My suspicion here is that this is a pathological case in which we get a lot of context switching, have to switch/handle a lot of OS connections, etc. I think it's conceivable that there are event loop efficiency issues also but I'm not sure (e.g., the way that we defer writes by activating the write ready event vs. doing them inline). cc @jmarantz our other perf guru |
I have some naive questions :) I'm inferring from the discussion that the Envoy is sharing a host with its single upstream service process. That right? How compute-intensive vs i/o-bound is the service? How multi-threaded is the service? I know that Envoy by default will create ~ 1 worker per HW thread; is that how you have it set? If the service had (say) 5 threads could you play games like subtracting 5, or have an Envoy option to use Max(ConfiguredMininumThreadCount, NumHardwareThreads - ConfiguredCarveOut) or something like that? What happens to perf if you give the upstream a dedicated machine on the same rack as the Envoy? |
Envoy is a sidecar to the service (service mesh ingress/ingress)
Running Envoy on more workers does not seem to appreciably improve the situation, and neither does listener connection balancing. I think there are lots of things to try in terms of for example pinning the service to 2 cores and pinning envoy to 2 cores, etc. but I haven't gone there yet. I've mainly been trying to figure out if anything in Envoy is blocking or taking a really long time, and I can't find anything. |
@mattklein123 another approach to tackling this kind of hot spotting is to just add a per-TCP connection timeout and do a graceful GOAWAY after this, e.g. every 10 minutes. This will cause the gRPC client to treat this TCP connection as drained and try another one. The new TCP connection will provide better stochastic balancing behavior. |
I agree this is a good approach in some cases, but not in every case, particularly the side car case in which the local service may only make a very small number of HTTP/2 connections to the side car. This is why I hacked up "perfect" CX balancing, though it didn't make much of a difference to out workload AFAICT. The only thing that I still want to spend some time looking at (probably not until next week) is whether how we handle write events is somehow contributing to the problem we are seeing. The current code never writes inline. It always activates an event which then handles the writes. I wonder if there are some pathological cases in which a bunch of write events get delayed to a single event loop iteration and then take a large amount of time. I don't see anything obvious here but it's worth a look. I will also work on upstreaming the new stats I added as well as the CX balancing code which I do think is useful for certain cases. |
In our case, I doubt we'd need any rebalancing of active connections if the mapping of brand new connections to Envoy threads was done round-robin or least-conns. In our cases, the many incoming connections have even request load across them, yet most of our Envoy threads appear to have virtually zero connections assigned to them. Rebalancing active connections would reactively detect and fix this form of imbalance, but there'd still be the initial period of large imbalance we'd have to work around (e.g. by over-scaling or eating bad tail latencies during deployments or scale ups). In a sense, auto re-balancing further hides this initial connection imbalance. Thoughts on starting with just better predictability/control over the mapping of new connections to worker threads? |
Sorry this is what I actually implemented, not rebalancing once accepted. Envoy will likely never support the latter given the architecture. |
@mattklein123 Is there visibility into event loop queue depth and/or time spent in queue? Could hidden latency could be while sitting in the event loop queue, e.g. if the processing on that thread was backed up? Not sure I'm describing this in a way that even makes sense w/ my limited understanding of Envoy :) |
Not currently, though this would be interesting to track potentially. cc @mergeconflict @jmarantz @oschaaf |
Thanks @mattklein123 for the details here. A few points to clarify: Do you expect the bottleneck originates in the egress path In our similar example we needed to leverage a pool of connections which is regularly refreshed every few minutes (as opposed to a single persistent HTTP-2 connection). With your connection (re-)balancing change in place we should no longer have to perform regular connection recycling albeit the need to still maintain a pool of connections from the service -> Envoy to leverage multiple worker threads. In-line with the point raised by @htuch I think it would be valuable for certain scenarios to expose Envoy cluster options for connection age/time-to-live. In addition to the one mentioned above we have also considered it to rebalance connected Envoy persistent bidi streams to XDS control plane in the mesh, esp on XDS server scale outs. |
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of #4602 Signed-off-by: Matt Klein <[email protected]>
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of #4602 Signed-off-by: Matt Klein <[email protected]>
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy/envoy#4602 Signed-off-by: Matt Klein <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 483aa09545a55853fa41710f80ceff23fcad290d
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>
This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes #4602 Signed-off-by: Matt Klein <[email protected]>
For those watching this issue, I have a PR up (#8422) which adds configurable connection balancing for TCP listeners. Please try it out. |
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>
This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>
FWIW, we started running #8263 in prod and things look pretty imbalanced:
Are others seeing similar natural imbalances? |
This doesn't surprise me at all. The kernel will generally keep things on the same thread if it thinks that it can avoid context switches, etc. Over the years I have come to realize that in almost all cases the kernel knows what it is doing, but for those small cases where it does not hold true please try #8422. :) |
This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes #4602 Signed-off-by: Matt Klein <[email protected]>
This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes envoyproxy/envoy#4602 Signed-off-by: Matt Klein <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 587e07974e6badb061ee3c9413660ab47f42750f
This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>
Have we considered use of SO_REUSEPORT on listeners in order to distribute incoming connections to accept queues via hashing? Granted, this socket option behavior may be specific to Linux. |
Some people already configure SO_REUSEPORT, at least across processes, and there has been discussion about allowing this within the process also so each worker has its own FD. cc @euroelessar. With that said, from previous experience, this still won't fix the case that I just fixed where you have a tiny number of connections and want to make sure they get spread evenly. |
See relevant envoy-dev thread: https://groups.google.com/forum/#!topic/envoy-dev/33QvlXyBinw
Some potential work here:
Note that I've tried variants of ^ over the years and it's extremely difficult to beat the kernel at its own game while generically improving all workloads. With that said, this is a very interesting area to investigate albeit incredibly time consuming.
The text was updated successfully, but these errors were encountered: