-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: partition raft transport over high RTT links across multiple TCP connections #111238
Comments
cc @cockroachdb/replication |
Note that there are two levels of window size in play: there's the TCP window size, and then HTTP/2 (which underlies GRPC) has its own similar window size mechanism. COCKROACH_RPC_INITIAL_WINDOW_SIZE (and subsequent comments about dynamic sizing) refers to the HTTP/2 window size, not the TCP one. (there's also a third one, HTTP/2's connection-level window, but IIRC we set that one to effectively infinite)
The first google result for My sense is that we should be able to tune the kernel parameters and GRPC for higher throughput in this scenario. I'd consider spreading the traffic across multiple TCP connections to be a last resort if the only consideration is increasing throughput. (There may be other benefits of exploring this path, such as reducing the impact of dropped packets). Longer-term, QUIC is nice for getting the kernel TCP implementation out of the loop completely. And we might consider sending raft log messages individually instead of as a single HTTP/2 or QUIC stream (we can reconstruct dropped messages from our local copy of the log, so we don't need to have that data duplicated in the kernel's tcp buffer and be limited by |
I'm inclined to agree with Ben here: we should set this at the OS level first, and we should recommend 16 MB by default (I sort of thought everyone already did this). I do think there might be some value in splitting out rangefeed and Raft traffic to separate connections though, separating them from other RPC traffic, since these can be fairly high-volume and RPC head-of-line blocking will directly impact foreground tail latencies. We could even consider splitting out a separate connection for Raft snapshots. But the last time I ran some basic benchmarks, this didn't appear to have any significant impact, at least not for the cases I was looking at. There's likely also other factors at play here, e.g. Let's explore this for 24.1 once the 23.2 branch is cut, it should be straightforward (apart from coming up with representative benchmarks). |
I also opened cockroachdb/docs#17924 to track the docs improvement for making this recommendation. |
111136: roachprod: fix symlink detection r=srosenberg a=renatolabs Previously, roachprod `put` would complain about symlinks if `EvalSymlinks(src) != src`. However, that is not always a reliable way to determine if a file is a symbolic link. The documentation for `EvalSymlinks` states: > If path is relative the result will be relative to the current > directory. Therefore, the condition mentioned above is not accurate. In particular, if one ran the command `roachprod put ./cockroach`, the symlink warning would always be displayed: `EvalSymlinks` would return `cockroach` (`./` prefix removed) and the paths would differ. This commit updates that logic to check for absolute paths instead. If the absolute paths differ, we get the symlink warning, and the warning message now also includes the path to the file that the symlink resolves to. Epic: none Release note: None 111169: keys: remove "/Intent" from lock-table key pretty printing r=nvanbenschoten a=nvanbenschoten Informs #109651. This commit removes the "/Intent" portion in the pretty-printing output of lock-table keys. This was confusing, as pretty-printing does not even currently have access to the lock strength because it doesn't have access to the key version. We'll need to fix that to address #109651. Until then, don't lie about the strength. Release note: None 111255: rpc: permit gRPC stream window sizes up to 64 MB r=knz,erikgrinaker a=nvanbenschoten Informs #111238. Informs #111241. CockroachDB accepts a COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable that can be used to set the initial window size for gRPC streams. However, the setting could previously only be used to reduce the initial window size, not to increase it. This commit fixes that, allowing the initial window size to be increased up to 64 MB. This is useful for multi-region clusters that want to push high volumes of writes over high-latency WAN links. Release note (general change): Increased the maximum permitted value of the COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable to 64MB. When used in conjunction with a tuned OS-level maximum TCP window size, this can increase the throughput that Raft replication can sustain over high-latency network links. Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Based on the discussion above:
|
Bandwidth-delay product principles dictate that the maximum throughput a system can push over a single network link is limited by the round-trip time ("delay") and the window size ("amount of unacknowledged bytes allowed"). Even high bandwidth links can have a relatively low effective throughput if they also have a high delay ("long fat networks"). For more details, see https://networklessons.com/cisco/ccnp-route/bandwidth-delay-product.
In multi-region deployments, such long fat networks (LFNs) are common. Raft is currently not configured to handle them as well as it could be, due to its use of a single TCP connection between nodes and the default maximum window sizes users typically deploy CockroachDB with. As a result, the Raft transport can be saturated by a sustained write load, leading to instability on both the ranges driving the load and other ranges that are not. Recall that the raft transport is shared across all ranges1 on a node that talk to a common destination node.
Let's talk about window sizes first.
Connection Window Sizes
Over a high latency link, a single connection's throughput is limited by its window size (in addition to other factors, like packet loss). Modern Linux distributions typically default the maximum TCP window size to 4MB, like we can see in roachprod:
CockroachDB goes a step further and limits the per gRPC connection window size to 2MB:
cockroach/pkg/rpc/settings.go
Lines 90 to 91 in 967fe5e
#35161 tried to disable this constant window size for gRPC's new dynamic window resizing, which would let a gRPC connection scale up to the OS's limit. That PR was closed at the time.
By 2MB, 4MB, or some other value in the low MBs, these window sizes are all limiting over high-latency links. For example, a 2MB window size over a 100ms link limits throughput to
1s/100ms * 2MB = 20MB/s
.We could adjust these on a case-by-case basis (and should where appropriate), but this still involves CRDB and OS-level manual tuning.
Multiple Connections
Assuming we can't always increase the maximum TCP window size above the OS-level default, we should explore alternatives to increase the raft transport throughput between nodes.
One approach would be to partition the raft transport across multiple gRPC/TCP connections. The simplest approach here would be to partition the raft transport's connections based on range ID, as there is no need to maintain the ordering of raft messages across ranges. Even within a range, ordering is not assumed for correctness, but too much reordering can hurt performance.
Jira issue: CRDB-31830
Epic CRDB-32846
Footnotes
this is ignoring the system range connection class separation, which protects system ranges from instability when the default connection class's raft transport is saturated. ↩
The text was updated successfully, but these errors were encountered: