Improve connection distribution in clusters to reduce latency and reactor utilization #12912

ballard26 · 2023-08-21T16:11:03Z

During testing with OMB on a 3x i3en.6xlarge cluster it was found that traffic from internal RPC clients can cause reactor utilization on some shards to be 40% higher than others. This largely caused by poor distribution of client connections to other brokers amongst shards.

Looking at RPC client in + out bytes in the graph below we can see that some shards are processing 4x the amount of throughput as others which can explain a lot of the extra reactor utilization on some shards.

In the chart below the allowed connections was set to equal the number of shards on a given broker. This allows each shard to have its own connections to all of the brokers in the cluster. And we can see in the chart that the standard deviation for reactor utilization is reduced by 50%.

The text was updated successfully, but these errors were encountered:

mattschumpert · 2023-08-30T18:50:11Z

@StephanDollberg can we update the title of this issue to represent the improvement the PR(s) will make please?

piyushredpanda · 2023-08-30T19:32:40Z

That's on @ballard26 to do. This will help with reducing variance in latencies.

ballard26 · 2023-08-30T19:55:54Z

I've updated the title let me know if it's adequate.

mattschumpert · 2023-09-12T18:10:30Z

💥

ballard26 added the kind/bug Something isn't working label Aug 21, 2023

ballard26 mentioned this issue Aug 21, 2023

Make max client connections configurable and refactor rpc::connection_cache #12906

Merged

7 tasks

piyushredpanda assigned ballard26 Aug 21, 2023

mattschumpert added the core label Aug 30, 2023

ballard26 changed the title ~~Poor connection distribution in clusters with more than 8 shards a node~~ Poor connection distribution in cluster Aug 30, 2023

ballard26 changed the title ~~Poor connection distribution in cluster~~ Improve connection distribution in clusters to reduce latency and reactor utilization Aug 30, 2023

ballard26 closed this as completed in #12906 Sep 12, 2023

github-actions bot mentioned this issue Dec 22, 2023

update redpanda appVersion from v23.2.21 to v23.3.1 redpanda-data/helm-charts#950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve connection distribution in clusters to reduce latency and reactor utilization #12912

Improve connection distribution in clusters to reduce latency and reactor utilization #12912

ballard26 commented Aug 21, 2023

mattschumpert commented Aug 30, 2023

piyushredpanda commented Aug 30, 2023

ballard26 commented Aug 30, 2023

mattschumpert commented Sep 12, 2023

Improve connection distribution in clusters to reduce latency and reactor utilization #12912

Improve connection distribution in clusters to reduce latency and reactor utilization #12912

Comments

ballard26 commented Aug 21, 2023

mattschumpert commented Aug 30, 2023

piyushredpanda commented Aug 30, 2023

ballard26 commented Aug 30, 2023

mattschumpert commented Sep 12, 2023