Investigate worker connection accept balance #4602

mattklein123 · 2018-10-04T22:28:06Z

See relevant envoy-dev thread: https://groups.google.com/forum/#!topic/envoy-dev/33QvlXyBinw

Some potential work here:

Performance profile different application loads against Envoy running on different numbers of cores.
Per worker CPU pinning.
Adding EPOLLEXCLUSIVE and EPOLLROUNDROBIN support to libevent and consume them in Envoy
Adding per worker stats for connection load, accept rate, etc.
Investigate Envoy built-in connection rebalance between workers (e.g., allow configurable slew between workers but if some worker gets too far from the highest loaded worker, send a new connection to the other worker).

Note that I've tried variants of ^ over the years and it's extremely difficult to beat the kernel at its own game while generically improving all workloads. With that said, this is a very interesting area to investigate albeit incredibly time consuming.

mattklein123 · 2018-10-04T22:28:12Z

cc @tonya11en

tonya11en · 2018-10-04T22:45:12Z

This can be broken out into a bunch of different tasks, so we can probably have multiple assignees here. I'll volunteer myself to be the first.

An important first step will be to get per-worker stats, so that's where I'll begin working. From there we'll have a basis to begin a more thorough investigation into the performance impact of the items above.

htuch · 2018-11-02T19:39:29Z

@tonya11en I'm currently working on per-worker stats for event loop duration. Have you started on any of the others? I can take some of these on as well.

tonya11en · 2018-11-05T17:48:04Z

@htuch I haven't started work on any of the others yet. Happy to help review.

mattklein123 · 2019-08-29T21:11:47Z

FYI for those watching I'm chasing an internal performance problem that may be related to this so I will be working on this.

mattklein123 · 2019-09-01T03:14:40Z

For anyone watching that cares about this issue, I have hacked up code that adds per-worker connection and watchdog miss stats and also re-balances active TCP connections across listeners. I plan on cleaning this up and upstreaming it, but I'm not sure exactly when I will get to it. If anyone urgently wants to work on this let me know. My WIP branch is here: https://github.com/envoyproxy/envoy/tree/cx_rebalance

ramaraochavali · 2019-09-01T13:09:59Z

@mattklein123 would you be able to share the symptoms of internal performance problem that you think are related to this if it is some thing that can be shared? The reason I am asking that is because we are doing some perf tests of a low latency gRPC service and we see that p99 latencies increase as the req/min increase and at the same we also see the p99 connect_time_ms to upstream cluster increases. I was wondering could it be due to high number of worker threads (we left the concurrency arg to default) or could be related to this specific issue? We have n't enabled the dispatcher stats yet. Any thoughts?

mattklein123 · 2019-09-01T19:43:20Z

@ramaraochavali we are chasing a very similar issue. We have a gRPC service which has many downstream callers (so a large number of incoming mesh connections) and a single upstream callee. At P99 the latency do the upstream service is almost double what the upstream service thinks its downstream latency is.

I've been looking at this for days and I can't find any issue here in terms of blocking or anything other than straight up load, but the delta is pretty large, like 30-40ms in some cases, which is quite large. The only thing I really have right now is this is a straight up load issue where Envoy and the service are context switching back and forth, and Envoy has to deal with a large number of individual connections, and at P99 there are interleavings that just go bad. I'm still looking though.

mattklein123 · 2019-09-01T19:45:11Z

I forgot to add that balancing the connections and increasing the workers does not seem to improve upstream P99 latency, which is why I think there may be scheduling contention with the service.

ramaraochavali · 2019-09-02T05:06:32Z

@mattklein123 Thanks a lot for the details. This is very useful information. I will share here if we find any thing different from what you have found bases on our tests/tweaks to the config.

mattklein123 · 2019-09-02T19:25:35Z

@oschaaf see the above thread if you have time. Do we have any NH simulations in which we try to model this type of service mesh use case? The characteristics would be:

gRPC API with a very small request/response payload, so headers/trailers and header compression dominate the workload.
Sidecar pattern with a local service both handling the incoming API as well as making a chained outgoing egress API call.
Lots of incoming callers with persistent connections, but each connection is very low rate, so we get no real benefit of transport level batching.

My suspicion here is that this is a pathological case in which we get a lot of context switching, have to switch/handle a lot of OS connections, etc. I think it's conceivable that there are event loop efficiency issues also but I'm not sure (e.g., the way that we defer writes by activating the write ready event vs. doing them inline).

cc @jmarantz our other perf guru

jmarantz · 2019-09-02T20:28:04Z

I have some naive questions :)

I'm inferring from the discussion that the Envoy is sharing a host with its single upstream service process. That right?

How compute-intensive vs i/o-bound is the service? How multi-threaded is the service? I know that Envoy by default will create ~ 1 worker per HW thread; is that how you have it set? If the service had (say) 5 threads could you play games like subtracting 5, or have an Envoy option to use Max(ConfiguredMininumThreadCount, NumHardwareThreads - ConfiguredCarveOut) or something like that?

What happens to perf if you give the upstream a dedicated machine on the same rack as the Envoy?

mattklein123 · 2019-09-03T01:21:28Z

I'm inferring from the discussion that the Envoy is sharing a host with its single upstream service process. That right?

Envoy is a sidecar to the service (service mesh ingress/ingress)

<Many downstreams> -> <Many downstream egress Envoy> -> <Ingress Envoy> -> <Service> -> <Egress Envoy (same as ingress Envoy)> -> Upstream Service

Running Envoy on more workers does not seem to appreciably improve the situation, and neither does listener connection balancing. I think there are lots of things to try in terms of for example pinning the service to 2 cores and pinning envoy to 2 cores, etc. but I haven't gone there yet. I've mainly been trying to figure out if anything in Envoy is blocking or taking a really long time, and I can't find anything.

htuch · 2019-09-03T02:14:51Z

@mattklein123 another approach to tackling this kind of hot spotting is to just add a per-TCP connection timeout and do a graceful GOAWAY after this, e.g. every 10 minutes. This will cause the gRPC client to treat this TCP connection as drained and try another one. The new TCP connection will provide better stochastic balancing behavior.

mattklein123 · 2019-09-03T16:25:13Z

@mattklein123 another approach to tackling this kind of hot spotting is to just add a per-TCP connection timeout and do a graceful GOAWAY after this, e.g. every 10 minutes.

I agree this is a good approach in some cases, but not in every case, particularly the side car case in which the local service may only make a very small number of HTTP/2 connections to the side car. This is why I hacked up "perfect" CX balancing, though it didn't make much of a difference to out workload AFAICT.

The only thing that I still want to spend some time looking at (probably not until next week) is whether how we handle write events is somehow contributing to the problem we are seeing. The current code never writes inline. It always activates an event which then handles the writes. I wonder if there are some pathological cases in which a bunch of write events get delayed to a single event loop iteration and then take a large amount of time. I don't see anything obvious here but it's worth a look.

I will also work on upstreaming the new stats I added as well as the CX balancing code which I do think is useful for certain cases.

davidkilliansc · 2019-09-03T17:52:13Z

re-balances active TCP connections across listeners

In our case, I doubt we'd need any rebalancing of active connections if the mapping of brand new connections to Envoy threads was done round-robin or least-conns. In our cases, the many incoming connections have even request load across them, yet most of our Envoy threads appear to have virtually zero connections assigned to them. Rebalancing active connections would reactively detect and fix this form of imbalance, but there'd still be the initial period of large imbalance we'd have to work around (e.g. by over-scaling or eating bad tail latencies during deployments or scale ups). In a sense, auto re-balancing further hides this initial connection imbalance. Thoughts on starting with just better predictability/control over the mapping of new connections to worker threads?

mattklein123 · 2019-09-03T17:53:43Z

Thoughts on starting with just better predictability/control over the mapping of new connections to worker threads?

Sorry this is what I actually implemented, not rebalancing once accepted. Envoy will likely never support the latter given the architecture.

davidkilliansc · 2019-09-09T19:51:57Z

I think it's conceivable that there are event loop efficiency issues also but I'm not sure (e.g., the way that we defer writes by activating the write ready event vs. doing them inline).

@mattklein123 Is there visibility into event loop queue depth and/or time spent in queue? Could hidden latency could be while sitting in the event loop queue, e.g. if the processing on that thread was backed up? Not sure I'm describing this in a way that even makes sense w/ my limited understanding of Envoy :)

mattklein123 · 2019-09-09T22:48:37Z

@mattklein123 Is there visibility into event loop queue depth and/or time spent in queue?

Not currently, though this would be interesting to track potentially. cc @mergeconflict @jmarantz @oschaaf

ahmedelbazsc · 2019-09-16T19:16:35Z

Thanks @mattklein123 for the details here. A few points to clarify:

Do you expect the bottleneck originates in the egress path [service -> egress Envoy -> upstream service] or we see the same behavior with direct response from service even when no upstream is involved? If the former, curious if there is a single persistent connection between the upstream callee (service) and its local egress Envoy, or we have a pool of connections that are getting rebalanced across workers with your patch?

In our similar example we needed to leverage a pool of connections which is regularly refreshed every few minutes (as opposed to a single persistent HTTP-2 connection). With your connection (re-)balancing change in place we should no longer have to perform regular connection recycling albeit the need to still maintain a pool of connections from the service -> Envoy to leverage multiple worker threads.

In-line with the point raised by @htuch I think it would be valuable for certain scenarios to expose Envoy cluster options for connection age/time-to-live. In addition to the one mentioned above we have also considered it to rebalance connected Envoy persistent bidi streams to XDS control plane in the mesh, esp on XDS server scale outs.

This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of #4602 Signed-off-by: Matt Klein <[email protected]>

This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy/envoy#4602 Signed-off-by: Matt Klein <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 483aa09545a55853fa41710f80ceff23fcad290d

This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>

This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes #4602 Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2019-09-29T20:36:24Z

For those watching this issue, I have a PR up (#8422) which adds configurable connection balancing for TCP listeners. Please try it out.

This PR does a few things: 1) Adds per-worker listener stats, useful for viewing worker connection imbalance. 2) Adds per-worker watchdog miss stats, useful for viewing per worker event loop latency. 3) Misc connection handling cleanups. Part of envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>

rgs1 · 2019-10-08T00:56:03Z

FWIW, we started running #8263 in prod and things look pretty imbalanced:

listener.0.0.0.0_443.downstream_cx_active: 5711
listener.0.0.0.0_443.downstream_cx_destroy: 895161
listener.0.0.0.0_443.downstream_cx_total: 895699
listener.0.0.0.0_443.downstream_pre_cx_active: 0
listener.0.0.0.0_443.downstream_pre_cx_timeout: 0
listener.0.0.0.0_443.worker_0.downstream_cx_active: 1
listener.0.0.0.0_443.worker_0.downstream_cx_total: 221
listener.0.0.0.0_443.worker_1.downstream_cx_active: 1
listener.0.0.0.0_443.worker_1.downstream_cx_total: 213
listener.0.0.0.0_443.worker_10.downstream_cx_active: 5
listener.0.0.0.0_443.worker_10.downstream_cx_total: 215
listener.0.0.0.0_443.worker_11.downstream_cx_active: 4
listener.0.0.0.0_443.worker_11.downstream_cx_total: 237
listener.0.0.0.0_443.worker_12.downstream_cx_active: 2
listener.0.0.0.0_443.worker_12.downstream_cx_total: 208
listener.0.0.0.0_443.worker_13.downstream_cx_active: 6
listener.0.0.0.0_443.worker_13.downstream_cx_total: 229
listener.0.0.0.0_443.worker_14.downstream_cx_active: 2
listener.0.0.0.0_443.worker_14.downstream_cx_total: 212
listener.0.0.0.0_443.worker_15.downstream_cx_active: 2
listener.0.0.0.0_443.worker_15.downstream_cx_total: 243
listener.0.0.0.0_443.worker_16.downstream_cx_active: 2
listener.0.0.0.0_443.worker_16.downstream_cx_total: 181
listener.0.0.0.0_443.worker_17.downstream_cx_active: 2
listener.0.0.0.0_443.worker_17.downstream_cx_total: 216
listener.0.0.0.0_443.worker_18.downstream_cx_active: 4
listener.0.0.0.0_443.worker_18.downstream_cx_total: 185
listener.0.0.0.0_443.worker_19.downstream_cx_active: 7
listener.0.0.0.0_443.worker_19.downstream_cx_total: 173
listener.0.0.0.0_443.worker_2.downstream_cx_active: 1
listener.0.0.0.0_443.worker_2.downstream_cx_total: 235
listener.0.0.0.0_443.worker_20.downstream_cx_active: 14
listener.0.0.0.0_443.worker_20.downstream_cx_total: 1105
listener.0.0.0.0_443.worker_21.downstream_cx_active: 4
listener.0.0.0.0_443.worker_21.downstream_cx_total: 189
listener.0.0.0.0_443.worker_22.downstream_cx_active: 4
listener.0.0.0.0_443.worker_22.downstream_cx_total: 185
listener.0.0.0.0_443.worker_23.downstream_cx_active: 2
listener.0.0.0.0_443.worker_23.downstream_cx_total: 166
listener.0.0.0.0_443.worker_24.downstream_cx_active: 1
listener.0.0.0.0_443.worker_24.downstream_cx_total: 235
listener.0.0.0.0_443.worker_25.downstream_cx_active: 49
listener.0.0.0.0_443.worker_25.downstream_cx_total: 4443
listener.0.0.0.0_443.worker_26.downstream_cx_active: 6
listener.0.0.0.0_443.worker_26.downstream_cx_total: 404
listener.0.0.0.0_443.worker_27.downstream_cx_active: 140
listener.0.0.0.0_443.worker_27.downstream_cx_total: 12694
listener.0.0.0.0_443.worker_28.downstream_cx_active: 438
listener.0.0.0.0_443.worker_28.downstream_cx_total: 60326
listener.0.0.0.0_443.worker_29.downstream_cx_active: 259
listener.0.0.0.0_443.worker_29.downstream_cx_total: 33042
listener.0.0.0.0_443.worker_3.downstream_cx_active: 1
listener.0.0.0.0_443.worker_3.downstream_cx_total: 212
listener.0.0.0.0_443.worker_30.downstream_cx_active: 931
listener.0.0.0.0_443.worker_30.downstream_cx_total: 157107
listener.0.0.0.0_443.worker_31.downstream_cx_active: 573
listener.0.0.0.0_443.worker_31.downstream_cx_total: 89898
listener.0.0.0.0_443.worker_32.downstream_cx_active: 723
listener.0.0.0.0_443.worker_32.downstream_cx_total: 115034
listener.0.0.0.0_443.worker_33.downstream_cx_active: 796
listener.0.0.0.0_443.worker_33.downstream_cx_total: 133812
listener.0.0.0.0_443.worker_34.downstream_cx_active: 913
listener.0.0.0.0_443.worker_34.downstream_cx_total: 151900
listener.0.0.0.0_443.worker_35.downstream_cx_active: 798
listener.0.0.0.0_443.worker_35.downstream_cx_total: 130669
listener.0.0.0.0_443.worker_4.downstream_cx_active: 2
listener.0.0.0.0_443.worker_4.downstream_cx_total: 308
listener.0.0.0.0_443.worker_5.downstream_cx_active: 8
listener.0.0.0.0_443.worker_5.downstream_cx_total: 238
listener.0.0.0.0_443.worker_6.downstream_cx_active: 1
listener.0.0.0.0_443.worker_6.downstream_cx_total: 250
listener.0.0.0.0_443.worker_7.downstream_cx_active: 0
listener.0.0.0.0_443.worker_7.downstream_cx_total: 197
listener.0.0.0.0_443.worker_8.downstream_cx_active: 6
listener.0.0.0.0_443.worker_8.downstream_cx_total: 289
listener.0.0.0.0_443.worker_9.downstream_cx_active: 3
listener.0.0.0.0_443.worker_9.downstream_cx_total: 228

Are others seeing similar natural imbalances?

mattklein123 · 2019-10-08T03:15:22Z

Are others seeing similar natural imbalances?

This doesn't surprise me at all. The kernel will generally keep things on the same thread if it thinks that it can avoid context switches, etc. Over the years I have come to realize that in almost all cases the kernel knows what it is doing, but for those small cases where it does not hold true please try #8422. :)

This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes #4602 Signed-off-by: Matt Klein <[email protected]>

This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes envoyproxy/envoy#4602 Signed-off-by: Matt Klein <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 587e07974e6badb061ee3c9413660ab47f42750f

This commit introduces optional connection rebalancing for TCP listeners, targeted as cases where there are a small number of long lived connections such as service mesh HTTP2/gRPC egress. Part of this change involved tracking connection counts at the per-listener level, which made it clear that we have quite a bit of tech debt in some of our interfaces in this area. I did various cleanups in service of this change which leave the connection handler / accept path in a cleaner state. Fixes envoyproxy#4602 Signed-off-by: Matt Klein <[email protected]>

antoniovicente · 2019-10-21T20:37:48Z

Have we considered use of SO_REUSEPORT on listeners in order to distribute incoming connections to accept queues via hashing? Granted, this socket option behavior may be specific to Linux.

mattklein123 · 2019-10-21T20:58:13Z

Have we considered use of SO_REUSEPORT on listeners in order to distribute incoming connections to accept queues via hashing? Granted, this socket option behavior may be specific to Linux.

Some people already configure SO_REUSEPORT, at least across processes, and there has been discussion about allowing this within the process also so each worker has its own FD. cc @euroelessar. With that said, from previous experience, this still won't fix the case that I just fixed where you have a tiny number of connections and want to make sure they get spread evenly.

mattklein123 added area/perf help wanted Needs help! labels Oct 4, 2018

mattklein123 mentioned this issue Aug 6, 2019

Non uniform connection distribution to Envoy worker threads #7831

Closed

mattklein123 self-assigned this Aug 23, 2019

mattklein123 mentioned this issue Sep 17, 2019

per-worker listener and watchdog stats #8263

Merged

mattklein123 mentioned this issue Sep 29, 2019

TCP listener connection rebalancing #8422

Merged

mattklein123 closed this as completed in #8422 Oct 11, 2019

htuch mentioned this issue Oct 13, 2020

Could envoy support specific work pthread auto bind cpu, if i start --concurrency 1 #13502

Closed

iyacontrol mentioned this issue Feb 4, 2021

envoy listener exact balancer projectcontour/contour#3314

Merged

zuercher mentioned this issue Mar 3, 2023

Connection balancing after upstream connection was already created #25915

Closed

bleggett mentioned this issue Aug 28, 2024

Use exact connection balancing for waypoint HBONE listeners istio/istio#52906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate worker connection accept balance #4602

Investigate worker connection accept balance #4602

mattklein123 commented Oct 4, 2018

mattklein123 commented Oct 4, 2018

tonya11en commented Oct 4, 2018

htuch commented Nov 2, 2018

tonya11en commented Nov 5, 2018

mattklein123 commented Aug 29, 2019

mattklein123 commented Sep 1, 2019

ramaraochavali commented Sep 1, 2019

mattklein123 commented Sep 1, 2019

mattklein123 commented Sep 1, 2019

ramaraochavali commented Sep 2, 2019

mattklein123 commented Sep 2, 2019

jmarantz commented Sep 2, 2019

mattklein123 commented Sep 3, 2019

htuch commented Sep 3, 2019

mattklein123 commented Sep 3, 2019

davidkilliansc commented Sep 3, 2019

mattklein123 commented Sep 3, 2019

davidkilliansc commented Sep 9, 2019

mattklein123 commented Sep 9, 2019

ahmedelbazsc commented Sep 16, 2019

mattklein123 commented Sep 29, 2019

rgs1 commented Oct 8, 2019

mattklein123 commented Oct 8, 2019

antoniovicente commented Oct 21, 2019

mattklein123 commented Oct 21, 2019

Investigate worker connection accept balance #4602

Investigate worker connection accept balance #4602

Comments

mattklein123 commented Oct 4, 2018

mattklein123 commented Oct 4, 2018

tonya11en commented Oct 4, 2018

htuch commented Nov 2, 2018

tonya11en commented Nov 5, 2018

mattklein123 commented Aug 29, 2019

mattklein123 commented Sep 1, 2019

ramaraochavali commented Sep 1, 2019

mattklein123 commented Sep 1, 2019

mattklein123 commented Sep 1, 2019

ramaraochavali commented Sep 2, 2019

mattklein123 commented Sep 2, 2019

jmarantz commented Sep 2, 2019

mattklein123 commented Sep 3, 2019

htuch commented Sep 3, 2019

mattklein123 commented Sep 3, 2019

davidkilliansc commented Sep 3, 2019

mattklein123 commented Sep 3, 2019

davidkilliansc commented Sep 9, 2019

mattklein123 commented Sep 9, 2019

ahmedelbazsc commented Sep 16, 2019

mattklein123 commented Sep 29, 2019

rgs1 commented Oct 8, 2019

mattklein123 commented Oct 8, 2019

antoniovicente commented Oct 21, 2019

mattklein123 commented Oct 21, 2019