Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: partition raft transport over high RTT links across multiple TCP connections #111238

Closed
nvanbenschoten opened this issue Sep 25, 2023 · 6 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 25, 2023

Bandwidth-delay product principles dictate that the maximum throughput a system can push over a single network link is limited by the round-trip time ("delay") and the window size ("amount of unacknowledged bytes allowed"). Even high bandwidth links can have a relatively low effective throughput if they also have a high delay ("long fat networks"). For more details, see https://networklessons.com/cisco/ccnp-route/bandwidth-delay-product.

In multi-region deployments, such long fat networks (LFNs) are common. Raft is currently not configured to handle them as well as it could be, due to its use of a single TCP connection between nodes and the default maximum window sizes users typically deploy CockroachDB with. As a result, the Raft transport can be saturated by a sustained write load, leading to instability on both the ranges driving the load and other ranges that are not. Recall that the raft transport is shared across all ranges1 on a node that talk to a common destination node.

Let's talk about window sizes first.

Connection Window Sizes

Over a high latency link, a single connection's throughput is limited by its window size (in addition to other factors, like packet loss). Modern Linux distributions typically default the maximum TCP window size to 4MB, like we can see in roachprod:

ubuntu@nathan-high-rtt-0001:~$ uname -a
Linux nathan-high-rtt-0001 5.4.0-1044-gcp #47-Ubuntu SMP Tue May 11 15:51:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_window_scaling
1
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_wmem
4096	16384	4194304
ubuntu@nathan-high-rtt-0001:~$ cat /proc/sys/net/ipv4/tcp_rmem
4096	131072	6291456

CockroachDB goes a step further and limits the per gRPC connection window size to 2MB:

s.values.initialWindowSize = getWindowSize(ctx,
"COCKROACH_RPC_INITIAL_WINDOW_SIZE", DefaultClass, defaultWindowSize*32)

#35161 tried to disable this constant window size for gRPC's new dynamic window resizing, which would let a gRPC connection scale up to the OS's limit. That PR was closed at the time.

By 2MB, 4MB, or some other value in the low MBs, these window sizes are all limiting over high-latency links. For example, a 2MB window size over a 100ms link limits throughput to 1s/100ms * 2MB = 20MB/s.

We could adjust these on a case-by-case basis (and should where appropriate), but this still involves CRDB and OS-level manual tuning.

Multiple Connections

Assuming we can't always increase the maximum TCP window size above the OS-level default, we should explore alternatives to increase the raft transport throughput between nodes.

One approach would be to partition the raft transport across multiple gRPC/TCP connections. The simplest approach here would be to partition the raft transport's connections based on range ID, as there is no need to maintain the ordering of raft messages across ranges. Even within a range, ordering is not assumed for correctness, but too much reordering can hurt performance.

Jira issue: CRDB-31830

Epic CRDB-32846

Footnotes

  1. this is ignoring the system range connection class separation, which protects system ranges from instability when the default connection class's raft transport is saturated.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-kv-replication Relating to Raft, consensus, and coordination. A-multiregion Related to multi-region T-kv-replication labels Sep 25, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 25, 2023

cc @cockroachdb/replication

@bdarnell
Copy link
Contributor

Note that there are two levels of window size in play: there's the TCP window size, and then HTTP/2 (which underlies GRPC) has its own similar window size mechanism. COCKROACH_RPC_INITIAL_WINDOW_SIZE (and subsequent comments about dynamic sizing) refers to the HTTP/2 window size, not the TCP one. (there's also a third one, HTTP/2's connection-level window, but IIRC we set that one to effectively infinite)

Modern Linux distributions typically default the maximum TCP window size to 4MB, like we can see in roachprod:

The first google result for tcp_wmem recommends setting the maximum to 16MB or higher "especially for 10 Gigabit adapters". https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings. This default is either outdated or overly conservative and we should probably recommend increasing it for multi-region CRDB deployments.

My sense is that we should be able to tune the kernel parameters and GRPC for higher throughput in this scenario. I'd consider spreading the traffic across multiple TCP connections to be a last resort if the only consideration is increasing throughput. (There may be other benefits of exploring this path, such as reducing the impact of dropped packets).

Longer-term, QUIC is nice for getting the kernel TCP implementation out of the loop completely. And we might consider sending raft log messages individually instead of as a single HTTP/2 or QUIC stream (we can reconstruct dropped messages from our local copy of the log, so we don't need to have that data duplicated in the kernel's tcp buffer and be limited by tcp_wmem)

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 26, 2023

I'm inclined to agree with Ben here: we should set this at the OS level first, and we should recommend 16 MB by default (I sort of thought everyone already did this).

I do think there might be some value in splitting out rangefeed and Raft traffic to separate connections though, separating them from other RPC traffic, since these can be fairly high-volume and RPC head-of-line blocking will directly impact foreground tail latencies. We could even consider splitting out a separate connection for Raft snapshots. But the last time I ran some basic benchmarks, this didn't appear to have any significant impact, at least not for the cases I was looking at.

There's likely also other factors at play here, e.g. RaftMaxInflightMsgs and RaftMaxInflightBytes, which affect the per-replica bandwidth-delay product.

Let's explore this for 24.1 once the 23.2 branch is cut, it should be straightforward (apart from coming up with representative benchmarks).

@erikgrinaker
Copy link
Contributor

I opened #111262 to consider a separate Raft connection class, at least, and we have #108992 to enable the rangefeed connection class by default.

@nvanbenschoten
Copy link
Member Author

I also opened cockroachdb/docs#17924 to track the docs improvement for making this recommendation.

craig bot pushed a commit that referenced this issue Sep 26, 2023
111136: roachprod: fix symlink detection r=srosenberg a=renatolabs

Previously, roachprod `put` would complain about symlinks if `EvalSymlinks(src) != src`. However, that is not always a reliable way to determine if a file is a symbolic link. The documentation for `EvalSymlinks` states:

> If path is relative the result will be relative to the current
> directory.

Therefore, the condition mentioned above is not accurate. In particular, if one ran the command `roachprod put ./cockroach`, the symlink warning would always be displayed: `EvalSymlinks` would return `cockroach` (`./` prefix removed) and the paths would differ.

This commit updates that logic to check for absolute paths instead. If the absolute paths differ, we get the symlink warning, and the warning message now also includes the path to the file that the symlink resolves to.

Epic: none

Release note: None

111169: keys: remove "/Intent" from lock-table key pretty printing r=nvanbenschoten a=nvanbenschoten

Informs #109651.

This commit removes the "/Intent" portion in the pretty-printing output of lock-table keys. This was confusing, as pretty-printing does not even currently have access to the lock strength because it doesn't have access to the key version. We'll need to fix that to address #109651. Until then, don't lie about the strength.

Release note: None

111255: rpc: permit gRPC stream window sizes up to 64 MB r=knz,erikgrinaker a=nvanbenschoten

Informs #111238.
Informs #111241.

CockroachDB accepts a COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable that can be used to set the initial window size for gRPC streams. However, the setting could previously only be used to reduce the initial window size, not to increase it.

This commit fixes that, allowing the initial window size to be increased up to 64 MB. This is useful for multi-region clusters that want to push high volumes of writes over high-latency WAN links.

Release note (general change): Increased the maximum permitted value of the COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable to 64MB. When used in conjunction with a tuned OS-level maximum TCP window size, this can increase the throughput that Raft replication can sustain over high-latency network links.

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jan 2, 2024
@nvanbenschoten nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Mar 4, 2024

Based on the discussion above:

@pav-kv pav-kv closed this as completed Mar 4, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
No open projects
Status: Closed
Development

No branches or pull requests

4 participants