kv: partition raft transport over high RTT links across multiple TCP connections #111238

nvanbenschoten · 2023-09-25T21:20:46Z

Bandwidth-delay product principles dictate that the maximum throughput a system can push over a single network link is limited by the round-trip time ("delay") and the window size ("amount of unacknowledged bytes allowed"). Even high bandwidth links can have a relatively low effective throughput if they also have a high delay ("long fat networks"). For more details, see https://networklessons.com/cisco/ccnp-route/bandwidth-delay-product.

In multi-region deployments, such long fat networks (LFNs) are common. Raft is currently not configured to handle them as well as it could be, due to its use of a single TCP connection between nodes and the default maximum window sizes users typically deploy CockroachDB with. As a result, the Raft transport can be saturated by a sustained write load, leading to instability on both the ranges driving the load and other ranges that are not. Recall that the raft transport is shared across all ranges¹ on a node that talk to a common destination node.

Let's talk about window sizes first.

Connection Window Sizes

Over a high latency link, a single connection's throughput is limited by its window size (in addition to other factors, like packet loss). Modern Linux distributions typically default the maximum TCP window size to 4MB, like we can see in roachprod:

ubuntu@nathan-high-rtt-0001:~$ uname -a
Linux nathan-high-rtt-0001 5.4.0-1044-gcp #47-Ubuntu SMP Tue May 11 15:51:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_window_scaling
1
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_wmem
4096	16384	4194304
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_rmem
4096	131072	6291456

CockroachDB goes a step further and limits the per gRPC connection window size to 2MB:

cockroach/pkg/rpc/settings.go

Lines 90 to 91 in 967fe5e

    
           s.values.initialWindowSize = getWindowSize(ctx, 
        
           	"COCKROACH_RPC_INITIAL_WINDOW_SIZE", DefaultClass, defaultWindowSize*32)

#35161 tried to disable this constant window size for gRPC's new dynamic window resizing, which would let a gRPC connection scale up to the OS's limit. That PR was closed at the time.

By 2MB, 4MB, or some other value in the low MBs, these window sizes are all limiting over high-latency links. For example, a 2MB window size over a 100ms link limits throughput to 1s/100ms * 2MB = 20MB/s.

We could adjust these on a case-by-case basis (and should where appropriate), but this still involves CRDB and OS-level manual tuning.

Multiple Connections

Assuming we can't always increase the maximum TCP window size above the OS-level default, we should explore alternatives to increase the raft transport throughput between nodes.

One approach would be to partition the raft transport across multiple gRPC/TCP connections. The simplest approach here would be to partition the raft transport's connections based on range ID, as there is no need to maintain the ordering of raft messages across ranges. Even within a range, ordering is not assumed for correctness, but too much reordering can hurt performance.

Jira issue: CRDB-31830

Epic CRDB-32846

this is ignoring the system range connection class separation, which protects system ranges from instability when the default connection class's raft transport is saturated. ↩

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-09-25T21:20:49Z

cc @cockroachdb/replication

bdarnell · 2023-09-26T00:27:41Z

Note that there are two levels of window size in play: there's the TCP window size, and then HTTP/2 (which underlies GRPC) has its own similar window size mechanism. COCKROACH_RPC_INITIAL_WINDOW_SIZE (and subsequent comments about dynamic sizing) refers to the HTTP/2 window size, not the TCP one. (there's also a third one, HTTP/2's connection-level window, but IIRC we set that one to effectively infinite)

Modern Linux distributions typically default the maximum TCP window size to 4MB, like we can see in roachprod:

The first google result for tcp_wmem recommends setting the maximum to 16MB or higher "especially for 10 Gigabit adapters". https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings. This default is either outdated or overly conservative and we should probably recommend increasing it for multi-region CRDB deployments.

My sense is that we should be able to tune the kernel parameters and GRPC for higher throughput in this scenario. I'd consider spreading the traffic across multiple TCP connections to be a last resort if the only consideration is increasing throughput. (There may be other benefits of exploring this path, such as reducing the impact of dropped packets).

Longer-term, QUIC is nice for getting the kernel TCP implementation out of the loop completely. And we might consider sending raft log messages individually instead of as a single HTTP/2 or QUIC stream (we can reconstruct dropped messages from our local copy of the log, so we don't need to have that data duplicated in the kernel's tcp buffer and be limited by tcp_wmem)

erikgrinaker · 2023-09-26T10:35:46Z

I'm inclined to agree with Ben here: we should set this at the OS level first, and we should recommend 16 MB by default (I sort of thought everyone already did this).

I do think there might be some value in splitting out rangefeed and Raft traffic to separate connections though, separating them from other RPC traffic, since these can be fairly high-volume and RPC head-of-line blocking will directly impact foreground tail latencies. We could even consider splitting out a separate connection for Raft snapshots. But the last time I ran some basic benchmarks, this didn't appear to have any significant impact, at least not for the cases I was looking at.

There's likely also other factors at play here, e.g. RaftMaxInflightMsgs and RaftMaxInflightBytes, which affect the per-replica bandwidth-delay product.

Let's explore this for 24.1 once the 23.2 branch is cut, it should be straightforward (apart from coming up with representative benchmarks).

erikgrinaker · 2023-09-26T11:04:08Z

I opened #111262 to consider a separate Raft connection class, at least, and we have #108992 to enable the rangefeed connection class by default.

nvanbenschoten · 2023-09-26T15:04:17Z

I also opened cockroachdb/docs#17924 to track the docs improvement for making this recommendation.

111136: roachprod: fix symlink detection r=srosenberg a=renatolabs Previously, roachprod `put` would complain about symlinks if `EvalSymlinks(src) != src`. However, that is not always a reliable way to determine if a file is a symbolic link. The documentation for `EvalSymlinks` states: > If path is relative the result will be relative to the current > directory. Therefore, the condition mentioned above is not accurate. In particular, if one ran the command `roachprod put ./cockroach`, the symlink warning would always be displayed: `EvalSymlinks` would return `cockroach` (`./` prefix removed) and the paths would differ. This commit updates that logic to check for absolute paths instead. If the absolute paths differ, we get the symlink warning, and the warning message now also includes the path to the file that the symlink resolves to. Epic: none Release note: None 111169: keys: remove "/Intent" from lock-table key pretty printing r=nvanbenschoten a=nvanbenschoten Informs #109651. This commit removes the "/Intent" portion in the pretty-printing output of lock-table keys. This was confusing, as pretty-printing does not even currently have access to the lock strength because it doesn't have access to the key version. We'll need to fix that to address #109651. Until then, don't lie about the strength. Release note: None 111255: rpc: permit gRPC stream window sizes up to 64 MB r=knz,erikgrinaker a=nvanbenschoten Informs #111238. Informs #111241. CockroachDB accepts a COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable that can be used to set the initial window size for gRPC streams. However, the setting could previously only be used to reduce the initial window size, not to increase it. This commit fixes that, allowing the initial window size to be increased up to 64 MB. This is useful for multi-region clusters that want to push high volumes of writes over high-latency WAN links. Release note (general change): Increased the maximum permitted value of the COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable to 64MB. When used in conjunction with a tuned OS-level maximum TCP window size, this can increase the throughput that Raft replication can sustain over high-latency network links. Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

pav-kv · 2024-03-04T18:59:03Z

Based on the discussion above:

We should recommend tuning TPC parameters for multi-region deployments: deployment: recommend increasing TCP max window size for multi-region deployents docs#17924.
For now, we separated raft and rangefeed traffic into dedicated connection classes: kvserver: separate Raft RPC connection class #111262, rangefeed: enable and improve rangefeed connection class #108992.

This was referenced Sep 25, 2023

kv: move node status range to system RPC class #111239

Closed

kv: sustained write load across many ranges with high RTT replication can overload (shared) raft transport #111241

Open

nvanbenschoten mentioned this issue Sep 26, 2023

rpc: permit gRPC stream window sizes up to 64 MB #111255

Merged

nvanbenschoten mentioned this issue Sep 26, 2023

deployment: recommend increasing TCP max window size for multi-region deployents cockroachdb/docs#17924

Open

This was referenced Sep 26, 2023

release-23.1: rpc: permit gRPC stream window sizes up to 64 MB #111287

Merged

release-22.2: rpc: permit gRPC stream window sizes up to 64 MB #111288

Merged

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jan 2, 2024

exalate-issue-sync bot assigned pav-kv Jan 2, 2024

nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024

pav-kv closed this as completed Mar 4, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: partition raft transport over high RTT links across multiple TCP connections #111238

kv: partition raft transport over high RTT links across multiple TCP connections #111238

nvanbenschoten commented Sep 25, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Sep 25, 2023

bdarnell commented Sep 26, 2023

erikgrinaker commented Sep 26, 2023 •

edited

Loading

erikgrinaker commented Sep 26, 2023

nvanbenschoten commented Sep 26, 2023

pav-kv commented Mar 4, 2024 •

edited

Loading

kv: partition raft transport over high RTT links across multiple TCP connections #111238

kv: partition raft transport over high RTT links across multiple TCP connections #111238

Comments

nvanbenschoten commented Sep 25, 2023 • edited by exalate-issue-sync bot Loading

Connection Window Sizes

Multiple Connections

Footnotes

blathers-crl bot commented Sep 25, 2023

bdarnell commented Sep 26, 2023

erikgrinaker commented Sep 26, 2023 • edited Loading

erikgrinaker commented Sep 26, 2023

nvanbenschoten commented Sep 26, 2023

pav-kv commented Mar 4, 2024 • edited Loading

nvanbenschoten commented Sep 25, 2023 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Sep 26, 2023 •

edited

Loading

pav-kv commented Mar 4, 2024 •

edited

Loading