rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

nvanbenschoten · 2024-12-04T00:48:19Z

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

Microbenchmarks:

name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)

Roachtests:

name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)

Manual tests:

Running in a similar configuration to sysbench/oltp_read_write/nodes=3/cpu=8/conc=64, but with a benchmarking related cluster settings (before and after) to reduce variance.

-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the rpc.batch_stream_pool.enabled cluster setting.

cockroach-teamcity · 2024-12-04T00:48:26Z

This change is

tbg · 2024-12-04T10:25:44Z

This looks great, thank you for putting that together so quickly.

I cherry-picked the stream reuse PR cockroachdb#136648 so that both are more likely to evolve in unison, and because I anticipated "piggybacking" on top of it. This did not materialize, but the important integration point is now side by side and should lend itself well to a future change that adds stream reuse for drpc as well. This commit roughly encompasses the following: 1. we change `*rpc.Connection` to also maintain a `drpcpool.Conn`. It took me a while to grok how the `drpcpool.Pool` is architected. Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn` that reflects a specific use case of the pool. When a client is created on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if necessary), and when the client is closed, this conn is returned to the pool. So in a sense, `drpcpool.Conn` is an actual pool for some keyed use case; `drpcpool.Pool` is simply the bucket in which the actual connections get pooled, but they're never pulled from or returned to it directly. We don't currently use the key since we create a pool per remote node, and if we're not sharing TCP conns they all look the same to us anyway (i.e. there's no point in DefaultClass vs SystemClass). If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense that you can "just" make multiple clients on top of it. Of course internally a `*grpc.ClientConn` multiplexes multiple clients over one HTTP2 connection, whereas a `drpcpool.Conn` represents multiple independent TCP connections. The lifecycle of these pools is currently completely broken, and I was honestly surprised TestTestServerDRPC passes! Future commits will need to clean this up to the point where we can at least run a representative benchmark.

nvanbenschoten · 2024-12-07T07:15:41Z

@tbg this should be good for a review, at your convenience. You should also be able to take this and extend it to work with dRPC, though it will need some changes if dRPC does not support multiple streams sharing a single TCP connection.

blathers-crl · 2024-12-07T17:07:04Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

tbg

Looks great. A few small comments but
Are there any roachtests we think might find new unexpected behavior? For example, anything that has to do with DistSQL circuit breaking or failover (failover/*?). Looking at the code I don't see what that something might be, so I'm not sure it's worth running them manually, especially since KV owns the failover tests anyway.

Reviewed 1 of 1 files at r4, 22 of 22 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)

pkg/rpc/nodedialer/nodedialer.go line 342 at r5 (raw file):

		return client
	}
	return &BatchStreamPoolClient{InternalClient: client, pool: pool}

Curious if you think this deserves to be pooled. It's not a large allocation, but it is an allocation; at what point is the hassle of pooling it worth it? I think it should be easy to pool this one.

pkg/rpc/stream_pool.go line 220 at r5 (raw file):

// Bind sets the gRPC connection and context for the streamPool. This must be
// called before streamPool.Send.

// called once before streamPool.Send.

Code quote:

// called before streamPool.Send.

pkg/rpc/stream_pool_test.go line 290 at r5 (raw file):

	require.Nil(t, res.resp)
	require.Error(t, res.err)
	require.ErrorIs(t, res.err, context.Canceled)

And the stream should not be back in the pool, right?

tbg

One more thing: In the rel note, do you want to mention the cluster setting to turn off this feature?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)

I cherry-picked the stream reuse PR cockroachdb#136648 so that both are more likely to evolve in unison, and because I anticipated "piggybacking" on top of it. This did not materialize, but the important integration point is now side by side and should lend itself well to a future change that adds stream reuse for drpc as well. This commit roughly encompasses the following: 1. we change `*rpc.Connection` to also maintain a `drpcpool.Conn`. It took me a while to grok how the `drpcpool.Pool` is architected. Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn` that reflects a specific use case of the pool. When a client is created on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if necessary), and when the client is closed, this conn is returned to the pool. So in a sense, `drpcpool.Conn` is an actual pool for some keyed use case; `drpcpool.Pool` is simply the bucket in which the actual connections get pooled, but they're never pulled from or returned to it directly. We don't currently use the key since we create a pool per remote node, and if we're not sharing TCP conns they all look the same to us anyway (i.e. there's no point in DefaultClass vs SystemClass). If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense that you can "just" make multiple clients on top of it. Of course internally a `*grpc.ClientConn` multiplexes multiple clients over one HTTP2 connection, whereas a `drpcpool.Conn` represents multiple independent TCP connections. The lifecycle of these pools is currently completely broken, and I was honestly surprised TestTestServerDRPC passes! Future commits will need to clean this up to the point where we can at least run a representative benchmark.

Closes cockroachdb#136572. This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke). The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand. The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is. Microbenchmarks: ``` name old time/op new time/op delta Sysbench/KV/1node_remote/oltp_point_select-10 45.9µs ± 1% 28.8µs ± 2% -37.31% (p=0.000 n=9+8) Sysbench/KV/1node_remote/oltp_read_only-10 958µs ± 6% 709µs ± 1% -26.00% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_only-10 3.65ms ± 6% 2.81ms ± 8% -23.06% (p=0.000 n=8+9) Sysbench/KV/1node_remote/oltp_read_write-10 1.77ms ± 5% 1.38ms ± 1% -22.09% (p=0.000 n=10+8) Sysbench/KV/1node_remote/oltp_write_only-10 688µs ± 4% 557µs ± 1% -19.11% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_point_select-10 181µs ± 8% 159µs ± 2% -12.10% (p=0.000 n=8+9) Sysbench/SQL/1node_remote/oltp_write_only-10 2.16ms ± 4% 1.92ms ± 3% -11.08% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_write-10 5.89ms ± 2% 5.36ms ± 1% -8.89% (p=0.000 n=9+9) name old alloc/op new alloc/op delta Sysbench/KV/1node_remote/oltp_point_select-10 16.3kB ± 0% 6.4kB ± 0% -60.70% (p=0.000 n=8+10) Sysbench/KV/1node_remote/oltp_write_only-10 359kB ± 1% 256kB ± 1% -28.92% (p=0.000 n=10+10) Sysbench/SQL/1node_remote/oltp_write_only-10 748kB ± 0% 548kB ± 1% -26.78% (p=0.000 n=8+10) Sysbench/SQL/1node_remote/oltp_point_select-10 40.9kB ± 0% 30.8kB ± 0% -24.74% (p=0.000 n=9+10) Sysbench/KV/1node_remote/oltp_read_write-10 1.11MB ± 1% 0.88MB ± 1% -21.17% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_read_write-10 2.00MB ± 0% 1.65MB ± 0% -17.60% (p=0.000 n=9+10) Sysbench/KV/1node_remote/oltp_read_only-10 790kB ± 0% 655kB ± 0% -17.11% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_only-10 1.33MB ± 0% 1.19MB ± 0% -10.97% (p=0.000 n=10+9) name old allocs/op new allocs/op delta Sysbench/KV/1node_remote/oltp_point_select-10 210 ± 0% 61 ± 0% -70.95% (p=0.000 n=10+10) Sysbench/KV/1node_remote/oltp_read_only-10 3.98k ± 0% 1.88k ± 0% -52.68% (p=0.019 n=6+8) Sysbench/KV/1node_remote/oltp_read_write-10 7.10k ± 0% 3.47k ± 0% -51.07% (p=0.000 n=10+9) Sysbench/KV/1node_remote/oltp_write_only-10 3.10k ± 0% 1.58k ± 0% -48.89% (p=0.000 n=10+9) Sysbench/SQL/1node_remote/oltp_write_only-10 6.73k ± 0% 3.82k ± 0% -43.30% (p=0.000 n=10+10) Sysbench/SQL/1node_remote/oltp_read_write-10 14.4k ± 0% 9.2k ± 0% -36.29% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_point_select-10 429 ± 0% 277 ± 0% -35.46% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_read_only-10 7.52k ± 0% 5.37k ± 0% -28.60% (p=0.000 n=10+10) ``` Roachtests: ``` name old queries/s new queries/s delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 17.6k ± 7% 19.2k ± 2% +9.22% (p=0.008 n=5+5) name old avg_ms/op new avg_ms/op delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 72.9 ± 7% 66.6 ± 2% -8.57% (p=0.008 n=5+5) name old p95_ms/op new p95_ms/op delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 116 ± 8% 106 ± 3% -9.02% (p=0.016 n=5+5) ``` Manual tests: ``` Running in a similar configuration to sysbench/oltp_read_write/nodes=3/cpu=8/conc=64, but with a benchmarking related cluster settings (before and after) to reduce variance. -- Before Mean: 19771.03 Median: 19714.22 Standard Deviation: 282.96 Coefficient of variance: .0143 -- After Mean: 21908.23 Median: 21923.03 Standard Deviation: 200.88 Coefficient of variance: .0091 ``` Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the rpc.batch_stream_pool.enabled cluster setting.

This commit rearranges the InternalClient implementations returned by Dialer.DialInternalClient to avoid multiple heap allocations. With gRPC stream pooling: ``` name old time/op new time/op delta Sysbench/KV/1node_remote/oltp_point_select-10 30.7µs ± 6% 30.3µs ± 4% ~ (p=0.436 n=9+9) Sysbench/KV/1node_remote/oltp_read_only-10 743µs ± 3% 736µs ± 1% ~ (p=0.721 n=8+8) Sysbench/KV/1node_remote/oltp_read_write-10 1.50ms ± 3% 1.51ms ± 6% ~ (p=0.730 n=9+9) name old alloc/op new alloc/op delta Sysbench/KV/1node_remote/oltp_point_select-10 6.61kB ± 0% 6.58kB ± 0% -0.49% (p=0.000 n=9+8) Sysbench/KV/1node_remote/oltp_read_only-10 656kB ± 0% 656kB ± 0% ~ (p=0.743 n=8+9) Sysbench/KV/1node_remote/oltp_read_write-10 883kB ± 0% 882kB ± 1% ~ (p=0.143 n=10+10) name old allocs/op new allocs/op delta Sysbench/KV/1node_remote/oltp_point_select-10 63.0 ± 0% 61.0 ± 0% -3.17% (p=0.000 n=10+10) Sysbench/KV/1node_remote/oltp_read_only-10 1.90k ± 0% 1.87k ± 0% -1.49% (p=0.000 n=10+8) Sysbench/KV/1node_remote/oltp_read_write-10 3.51k ± 0% 3.46k ± 0% -1.35% (p=0.000 n=9+8) ``` Without gRPC stream pooling: ``` name old time/op new time/op delta Sysbench/KV/1node_remote/oltp_read_write-10 1.88ms ± 3% 1.83ms ± 2% -2.16% (p=0.029 n=10+10) Sysbench/KV/1node_remote/oltp_point_select-10 46.9µs ± 1% 46.7µs ± 1% -0.49% (p=0.022 n=9+10) Sysbench/KV/1node_remote/oltp_read_only-10 976µs ± 2% 973µs ± 4% ~ (p=0.497 n=9+10) name old alloc/op new alloc/op delta Sysbench/KV/1node_remote/oltp_point_select-10 16.3kB ± 0% 16.2kB ± 0% -0.06% (p=0.000 n=8+7) Sysbench/KV/1node_remote/oltp_read_only-10 788kB ± 0% 788kB ± 0% ~ (p=0.353 n=10+10) Sysbench/KV/1node_remote/oltp_read_write-10 1.11MB ± 0% 1.11MB ± 0% ~ (p=0.436 n=10+10) name old allocs/op new allocs/op delta Sysbench/KV/1node_remote/oltp_point_select-10 209 ± 0% 208 ± 0% -0.48% (p=0.000 n=10+10) Sysbench/KV/1node_remote/oltp_read_write-10 7.07k ± 0% 7.04k ± 0% -0.36% (p=0.000 n=10+10) Sysbench/KV/1node_remote/oltp_read_only-10 3.95k ± 0% 3.94k ± 0% -0.35% (p=0.008 n=6+7) ``` Epic: None Release note: None

This will help prototype drpc stream pooling.

nvanbenschoten

TFTR!

Are there any roachtests we think might find new unexpected behavior? For example, anything that has to do with DistSQL circuit breaking or failover (failover/*?). Looking at the code I don't see what that something might be, so I'm not sure it's worth running them manually, especially since KV owns the failover tests anyway.

This is a good question. I'm not aware of any new failure modes with this, because we were already just laying unary RPCs over long-lived gRPC connections, which are the actual resource that can fail. I'll keep an eye on the failover tests over the next few weeks (which I've been doing anyway for leader leases).

One more thing: In the rel note, do you want to mention the cluster setting to turn off this feature?

Done.

The "rpc: make batch stream pool general over Conn" commit from here might be something you could cherry-pick into your grpc stream pooling PR, btw. Especially if you're going to make any code changes in that PR I'd appreciate that.

(from Slack)

Done.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @tbg)

pkg/rpc/stream_pool.go line 220 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// called once before streamPool.Send.

Done.

pkg/rpc/nodedialer/nodedialer.go line 342 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Curious if you think this deserves to be pooled. It's not a large allocation, but it is an allocation; at what point is the hassle of pooling it worth it? I think it should be easy to pool this one.

Done in a new commit using a similar approach to what @RaduBerinde suggested in #136941.

I'll let you give that commit a review before merging.

pkg/rpc/stream_pool_test.go line 290 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

And the stream should not be back in the pool, right?

Done.

tbg

nice!

Reviewed 6 of 6 files at r6, 2 of 2 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)

-- commits line 118 at r7:
nice that this shows up!

nvanbenschoten · 2024-12-10T16:10:10Z

TFTR!

bors r=tbg

136648: rpc: reuse gRPC streams across unary BatchRequest RPCs r=tbg a=nvanbenschoten Closes #136572. This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke). The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand. The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is. ### Microbenchmarks: ``` name old time/op new time/op delta Sysbench/KV/1node_remote/oltp_point_select-10 45.9µs ± 1% 28.8µs ± 2% -37.31% (p=0.000 n=9+8) Sysbench/KV/1node_remote/oltp_read_only-10 958µs ± 6% 709µs ± 1% -26.00% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_only-10 3.65ms ± 6% 2.81ms ± 8% -23.06% (p=0.000 n=8+9) Sysbench/KV/1node_remote/oltp_read_write-10 1.77ms ± 5% 1.38ms ± 1% -22.09% (p=0.000 n=10+8) Sysbench/KV/1node_remote/oltp_write_only-10 688µs ± 4% 557µs ± 1% -19.11% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_point_select-10 181µs ± 8% 159µs ± 2% -12.10% (p=0.000 n=8+9) Sysbench/SQL/1node_remote/oltp_write_only-10 2.16ms ± 4% 1.92ms ± 3% -11.08% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_write-10 5.89ms ± 2% 5.36ms ± 1% -8.89% (p=0.000 n=9+9) name old alloc/op new alloc/op delta Sysbench/KV/1node_remote/oltp_point_select-10 16.3kB ± 0% 6.4kB ± 0% -60.70% (p=0.000 n=8+10) Sysbench/KV/1node_remote/oltp_write_only-10 359kB ± 1% 256kB ± 1% -28.92% (p=0.000 n=10+10) Sysbench/SQL/1node_remote/oltp_write_only-10 748kB ± 0% 548kB ± 1% -26.78% (p=0.000 n=8+10) Sysbench/SQL/1node_remote/oltp_point_select-10 40.9kB ± 0% 30.8kB ± 0% -24.74% (p=0.000 n=9+10) Sysbench/KV/1node_remote/oltp_read_write-10 1.11MB ± 1% 0.88MB ± 1% -21.17% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_read_write-10 2.00MB ± 0% 1.65MB ± 0% -17.60% (p=0.000 n=9+10) Sysbench/KV/1node_remote/oltp_read_only-10 790kB ± 0% 655kB ± 0% -17.11% (p=0.000 n=9+9) Sysbench/SQL/1node_remote/oltp_read_only-10 1.33MB ± 0% 1.19MB ± 0% -10.97% (p=0.000 n=10+9) name old allocs/op new allocs/op delta Sysbench/KV/1node_remote/oltp_point_select-10 210 ± 0% 61 ± 0% -70.95% (p=0.000 n=10+10) Sysbench/KV/1node_remote/oltp_read_only-10 3.98k ± 0% 1.88k ± 0% -52.68% (p=0.019 n=6+8) Sysbench/KV/1node_remote/oltp_read_write-10 7.10k ± 0% 3.47k ± 0% -51.07% (p=0.000 n=10+9) Sysbench/KV/1node_remote/oltp_write_only-10 3.10k ± 0% 1.58k ± 0% -48.89% (p=0.000 n=10+9) Sysbench/SQL/1node_remote/oltp_write_only-10 6.73k ± 0% 3.82k ± 0% -43.30% (p=0.000 n=10+10) Sysbench/SQL/1node_remote/oltp_read_write-10 14.4k ± 0% 9.2k ± 0% -36.29% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_point_select-10 429 ± 0% 277 ± 0% -35.46% (p=0.000 n=9+10) Sysbench/SQL/1node_remote/oltp_read_only-10 7.52k ± 0% 5.37k ± 0% -28.60% (p=0.000 n=10+10) ``` ### Roachtests: ``` name old queries/s new queries/s delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 17.6k ± 7% 19.2k ± 2% +9.22% (p=0.008 n=5+5) name old avg_ms/op new avg_ms/op delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 72.9 ± 7% 66.6 ± 2% -8.57% (p=0.008 n=5+5) name old p95_ms/op new p95_ms/op delta sysbench/oltp_read_write/nodes=3/cpu=8/conc=64 116 ± 8% 106 ± 3% -9.02% (p=0.016 n=5+5) ``` ### Manual tests: Running in a similar configuration to `sysbench/oltp_read_write/nodes=3/cpu=8/conc=64`, but with a benchmarking related cluster settings (before and after) to reduce variance. ``` -- Before Mean: 19771.03 Median: 19714.22 Standard Deviation: 282.96 Coefficient of variance: .0143 -- After Mean: 21908.23 Median: 21923.03 Standard Deviation: 200.88 Coefficient of variance: .0091 ``` ---- Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the `rpc.batch_stream_pool.enabled` cluster setting. 137059: catalog/lease: deflake TestDescriptorRefreshOnRetry r=rafiss a=rafiss The test was flaky since the background thread to refresh leases could run and cause the acquisition counts to be off. fixes #137033 Release note: None 137067: roachtest: update mt-upgrade test owner to db-server r=rimadeodhar a=rimadeodhar This PR updates the test ownership for the multitenant-upgrade test to the DB Server team. All future test failures will be routed to `t-db-server` for triage. Epic: none Release note: none Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: rimadeodhar <[email protected]>

craig · 2024-12-10T18:00:35Z

Build failed (retrying...):

unit_tests

craig · 2024-12-10T20:02:07Z

Build succeeded:

I cherry-picked the stream reuse PR cockroachdb#136648 so that both are more likely to evolve in unison, and because I anticipated "piggybacking" on top of it. This did not materialize, but the important integration point is now side by side and should lend itself well to a future change that adds stream reuse for drpc as well. This commit roughly encompasses the following: 1. we change `*rpc.Connection` to also maintain a `drpcpool.Conn`. It took me a while to grok how the `drpcpool.Pool` is architected. Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn` that reflects a specific use case of the pool. When a client is created on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if necessary), and when the client is closed, this conn is returned to the pool. So in a sense, `drpcpool.Conn` is an actual pool for some keyed use case; `drpcpool.Pool` is simply the bucket in which the actual connections get pooled, but they're never pulled from or returned to it directly. We don't currently use the key since we create a pool per remote node, and if we're not sharing TCP conns they all look the same to us anyway (i.e. there's no point in DefaultClass vs SystemClass). If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense that you can "just" make multiple clients on top of it. Of course internally a `*grpc.ClientConn` multiplexes multiple clients over one HTTP2 connection, whereas a `drpcpool.Conn` represents multiple independent TCP connections. The lifecycle of these pools is currently completely broken, and I was honestly surprised TestTestServerDRPC passes! Future commits will need to clean this up to the point where we can at least run a representative benchmark.

Reuse drpc streams across unary operations. This is similar to the functionality implemented for gRPC in github.com/cockroachdb/pull/136648. Epic: None Release note: None

138368: all: host a DRPC server and use it for node-node BatchRequests r=cthumuluru-crdb a=cthumuluru-crdb drpc (github.com/storj/drpc) is a lightweight, drop-in replacement for gRPC. Initial benchmark results from the Perf Efficiency team, using a simple ping service (github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy), demonstrate that switching from gRPC to drpc delivers significant performance improvements. ``` $ benchdiff --old lastmerge ./pkg/sql/tests -b -r 'Sysbench/SQL/3node/oltp_read_write' -d 1000x -c 10 name old time/op new time/op delta Sysbench/SQL/3node/oltp_read_write-24 14.0ms ± 2% 14.1ms ± 1% ~ (p=0.063 n=10+10) name old alloc/op new alloc/op delta Sysbench/SQL/3node/oltp_read_write-24 2.55MB ± 1% 2.26MB ± 2% -11.39% (p=0.000 n=9+9) name old allocs/op new allocs/op delta Sysbench/SQL/3node/oltp_read_write-24 17.2k ± 1% 11.7k ± 0% -32.30% (p=0.000 n=10+8) ``` This pull request introduces experimental support for using drpc to serve BatchRequests. The feature is disabled by default but can be enabled explicitly by setting the `COCKROACH_EXPERIMENTAL_DRPC_ENABLED` environment variable or automatically in CRDB test builds. The prototype lacks key features, such as authorization checks, interceptors, etc, and is not production-ready. When DRPC is disabled, sever, muxer and listener mock implementations are injected. If it is enabled real implementations are used. DRPC is only used for BatchRequests. This PR also added support for pooled streaming unary operations similar to #136648. Added unit tests to ensure: - CRDB nodes can server BatchRequests. - Simple queries when DRPC is enabled or disabled. Note: drpc protocol generation is not yet integrated into the build process. Protobuf files were generated manually and committed directly to the repository. This PR has a few parts: - it adds a copy of drpc generated code for a service that supports unary BatchRequest. - in `rpc.NewServerEx`, we now also set up a drpc server and return it up the stack. - Registers BatchRequest service with both DRPC and gRPC. If DRPC is enabled, it is used to make BatchRequests, if not gRPC is used. - in our listener setup, we configure cmux to match on the `DRPC!!!1` header and serve the resulting listener on the drpc server. The drpc example uses `drpcmigrate.ListenMux` instead of cmux; we keep cmux but must make sure the header is read off the connection before delegating (which the drpxmigrate mux would have done for us). - if using TLS, wrap the drpc listener with TLS config and use it to servr DRPC requests. - add support to reuse drpc streams across unary BatchRequests. However, the DRPC implementation is not on par with the gRPC counterpart in terms of allocation optimizations. Closes #136794. Epic: CRDB-42584 Release note: None Co-authored-by: Chandra Thumuluru <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

nvanbenschoten requested a review from tbg December 4, 2024 00:48

nvanbenschoten requested review from a team as code owners December 4, 2024 00:48

nvanbenschoten requested review from kyle-a-wong, arjunmahishi and aa-joshi and removed request for a team December 4, 2024 00:48

tbg requested a review from cthumuluru-crdb December 4, 2024 10:09

tbg mentioned this pull request Dec 5, 2024

[prototype] server,rpc: host a dRPC server and use it for node-node BatchRequest #136794

Closed

nvanbenschoten added the o-perf-efficiency Related to performance efficiency label Dec 6, 2024

nvanbenschoten changed the title ~~[prototype] rpc: reuse gRPC streams across unary BatchRequest RPCs~~ rpc: reuse gRPC streams across unary BatchRequest RPCs Dec 7, 2024

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 99a78b7 to 5b2e383 Compare December 7, 2024 05:59

nvanbenschoten requested review from a team as code owners December 7, 2024 05:59

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 5b2e383 to f334557 Compare December 7, 2024 06:46

cockroachdb deleted a comment from blathers-crl bot Dec 7, 2024

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from f334557 to 69262d5 Compare December 7, 2024 07:10

cockroachdb deleted a comment from blathers-crl bot Dec 7, 2024

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 69262d5 to 2c2eb87 Compare December 7, 2024 17:07

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch 2 times, most recently from 5ca46ff to 4596eb9 Compare December 7, 2024 20:45

tbg approved these changes Dec 9, 2024

View reviewed changes

nvanbenschoten mentioned this pull request Dec 9, 2024

rpc: avoid allocation of TracingInternalClient when tracing disabled #136941

Merged

nvanbenschoten added 2 commits December 9, 2024 20:22

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 4596eb9 to 7f4ffeb Compare December 10, 2024 01:23

rpc: make batch stream pool general over Conn

db408d2

This will help prototype drpc stream pooling.

nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 7f4ffeb to db408d2 Compare December 10, 2024 01:44

nvanbenschoten requested a review from tbg December 10, 2024 02:36

nvanbenschoten commented Dec 10, 2024

View reviewed changes

tbg approved these changes Dec 10, 2024

View reviewed changes

nvanbenschoten mentioned this pull request Dec 10, 2024

kv: allocate Batch{Request|Response} header and {requests|responses} together #137108

Merged

craig bot merged commit 95d95a6 into cockroachdb:master Dec 10, 2024
23 checks passed

tbg mentioned this pull request Dec 11, 2024

rpc: consider adopting drpc (12% improvement on sysbench oltp_read_write) #137252

Open

20 tasks

nvanbenschoten deleted the nvanbenschoten/gRPCUnaryReuse branch December 13, 2024 00:14

cthumuluru-crdb mentioned this pull request Jan 13, 2025

all: host a DRPC server and use it for node-node BatchRequests #138368

Merged

mohini-crl mentioned this pull request Jan 17, 2025

[Replicated] all: host a DRPC server and use it for node-node BatchRequests mohini-crl/cockroach#191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

nvanbenschoten commented Dec 4, 2024 •

edited

Loading

cockroach-teamcity commented Dec 4, 2024

tbg commented Dec 4, 2024

nvanbenschoten commented Dec 7, 2024

blathers-crl bot commented Dec 7, 2024

tbg left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten commented Dec 10, 2024

craig bot commented Dec 10, 2024

craig bot commented Dec 10, 2024

rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

Conversation

nvanbenschoten commented Dec 4, 2024 • edited Loading

Microbenchmarks:

Roachtests:

Manual tests:

cockroach-teamcity commented Dec 4, 2024

tbg commented Dec 4, 2024

nvanbenschoten commented Dec 7, 2024

blathers-crl bot commented Dec 7, 2024

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 10, 2024

craig bot commented Dec 10, 2024

craig bot commented Dec 10, 2024

nvanbenschoten commented Dec 4, 2024 •

edited

Loading