Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: reuse gRPC streams across unary BatchRequest RPCs #136648

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Dec 4, 2024

Closes #136572.

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

Microbenchmarks:

name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)

Roachtests:

name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)

Manual tests:

Running in a similar configuration to sysbench/oltp_read_write/nodes=3/cpu=8/conc=64, but with a benchmarking related cluster settings (before and after) to reduce variance.

-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the rpc.batch_stream_pool.enabled cluster setting.

@nvanbenschoten nvanbenschoten requested a review from tbg December 4, 2024 00:48
@nvanbenschoten nvanbenschoten requested review from a team as code owners December 4, 2024 00:48
@nvanbenschoten nvanbenschoten requested review from kyle-a-wong, arjunmahishi and aa-joshi and removed request for a team December 4, 2024 00:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from cthumuluru-crdb December 4, 2024 10:09
@tbg
Copy link
Member

tbg commented Dec 4, 2024

This looks great, thank you for putting that together so quickly.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 6, 2024
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
@nvanbenschoten nvanbenschoten added the o-perf-efficiency Related to performance efficiency label Dec 6, 2024
@nvanbenschoten nvanbenschoten changed the title [prototype] rpc: reuse gRPC streams across unary BatchRequest RPCs rpc: reuse gRPC streams across unary BatchRequest RPCs Dec 7, 2024
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 99a78b7 to 5b2e383 Compare December 7, 2024 05:59
@nvanbenschoten nvanbenschoten requested review from a team as code owners December 7, 2024 05:59
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 5b2e383 to f334557 Compare December 7, 2024 06:46
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Dec 7, 2024
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from f334557 to 69262d5 Compare December 7, 2024 07:10
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Dec 7, 2024
@nvanbenschoten
Copy link
Member Author

@tbg this should be good for a review, at your convenience. You should also be able to take this and extend it to work with dRPC, though it will need some changes if dRPC does not support multiple streams sharing a single TCP connection.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 69262d5 to 2c2eb87 Compare December 7, 2024 17:07
Copy link

blathers-crl bot commented Dec 7, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch 2 times, most recently from 5ca46ff to 4596eb9 Compare December 7, 2024 20:45
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. A few small comments but :lgtm_strong:
Are there any roachtests we think might find new unexpected behavior? For example, anything that has to do with DistSQL circuit breaking or failover (failover/*?). Looking at the code I don't see what that something might be, so I'm not sure it's worth running them manually, especially since KV owns the failover tests anyway.

Reviewed 1 of 1 files at r4, 22 of 22 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)


pkg/rpc/nodedialer/nodedialer.go line 342 at r5 (raw file):

		return client
	}
	return &BatchStreamPoolClient{InternalClient: client, pool: pool}

Curious if you think this deserves to be pooled. It's not a large allocation, but it is an allocation; at what point is the hassle of pooling it worth it? I think it should be easy to pool this one.


pkg/rpc/stream_pool.go line 220 at r5 (raw file):

// Bind sets the gRPC connection and context for the streamPool. This must be
// called before streamPool.Send.

// called once before streamPool.Send.

Code quote:

// called before streamPool.Send.

pkg/rpc/stream_pool_test.go line 290 at r5 (raw file):

	require.Nil(t, res.resp)
	require.Error(t, res.err)
	require.ErrorIs(t, res.err, context.Canceled)

And the stream should not be back in the pool, right?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: In the rel note, do you want to mention the cluster setting to turn off this feature?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)

tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2024
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2024
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
Closes cockroachdb#136572.

This commit introduces pooling of gRPC streams that are used to send requests
and receive corresponding responses in a manner that mimics unary RPC
invocation. Pooling these streams allows for reuse of gRPC resources across
calls, as opposed to native unary RPCs, which create a new stream and throw it
away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is
the dominant RPC method used to communicate between the KV client and KV server.
A new Internal/BatchStream RPC method is introduced to allow a client to send
and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A
pool of these streams is then maintained alongside each gRPC connection. The
pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks
and full system benchmarks, which reveals just how expensive the gRPC stream
setup on each unary RPC is.

Microbenchmarks:
```
name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)
```

Roachtests:
```
name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)
```

Manual tests:
```
Running in a similar configuration to sysbench/oltp_read_write/nodes=3/cpu=8/conc=64,
but with a benchmarking related cluster settings (before and after) to reduce variance.

-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091
```

Release note (performance improvement): gRPC streams are now pooled
across unary intra-cluster RPCs, allowing for reuse of gRPC resources
to reduce the cost of remote key-value layer access. This pooling can
be disabled using the rpc.batch_stream_pool.enabled cluster setting.
This commit rearranges the InternalClient implementations returned by
Dialer.DialInternalClient to avoid multiple heap allocations.

With gRPC stream pooling:
```
name                                           old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10    30.7µs ± 6%    30.3µs ± 4%    ~     (p=0.436 n=9+9)
Sysbench/KV/1node_remote/oltp_read_only-10        743µs ± 3%     736µs ± 1%    ~     (p=0.721 n=8+8)
Sysbench/KV/1node_remote/oltp_read_write-10      1.50ms ± 3%    1.51ms ± 6%    ~     (p=0.730 n=9+9)

name                                           old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10    6.61kB ± 0%    6.58kB ± 0%  -0.49%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10        656kB ± 0%     656kB ± 0%    ~     (p=0.743 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       883kB ± 0%     882kB ± 1%    ~     (p=0.143 n=10+10)

name                                           old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10      63.0 ± 0%      61.0 ± 0%  -3.17%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10        1.90k ± 0%     1.87k ± 0%  -1.49%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_read_write-10       3.51k ± 0%     3.46k ± 0%  -1.35%  (p=0.000 n=9+8)
```

Without gRPC stream pooling:
```
name                                           old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_read_write-10      1.88ms ± 3%    1.83ms ± 2%  -2.16%  (p=0.029 n=10+10)
Sysbench/KV/1node_remote/oltp_point_select-10    46.9µs ± 1%    46.7µs ± 1%  -0.49%  (p=0.022 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10        976µs ± 2%     973µs ± 4%    ~     (p=0.497 n=9+10)

name                                           old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10    16.3kB ± 0%    16.2kB ± 0%  -0.06%  (p=0.000 n=8+7)
Sysbench/KV/1node_remote/oltp_read_only-10        788kB ± 0%     788kB ± 0%    ~     (p=0.353 n=10+10)
Sysbench/KV/1node_remote/oltp_read_write-10      1.11MB ± 0%    1.11MB ± 0%    ~     (p=0.436 n=10+10)

name                                           old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10       209 ± 0%       208 ± 0%  -0.48%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_write-10       7.07k ± 0%     7.04k ± 0%  -0.36%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10        3.95k ± 0%     3.94k ± 0%  -0.35%  (p=0.008 n=6+7)
```

Epic: None
Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 4596eb9 to 7f4ffeb Compare December 10, 2024 01:23
This will help prototype drpc stream pooling.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/gRPCUnaryReuse branch from 7f4ffeb to db408d2 Compare December 10, 2024 01:44
@nvanbenschoten nvanbenschoten requested a review from tbg December 10, 2024 02:36
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Are there any roachtests we think might find new unexpected behavior? For example, anything that has to do with DistSQL circuit breaking or failover (failover/*?). Looking at the code I don't see what that something might be, so I'm not sure it's worth running them manually, especially since KV owns the failover tests anyway.

This is a good question. I'm not aware of any new failure modes with this, because we were already just laying unary RPCs over long-lived gRPC connections, which are the actual resource that can fail. I'll keep an eye on the failover tests over the next few weeks (which I've been doing anyway for leader leases).

One more thing: In the rel note, do you want to mention the cluster setting to turn off this feature?

Done.

The "rpc: make batch stream pool general over Conn" commit from here might be something you could cherry-pick into your grpc stream pooling PR, btw. Especially if you're going to make any code changes in that PR I'd appreciate that.

(from Slack)

Done.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @tbg)


pkg/rpc/stream_pool.go line 220 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// called once before streamPool.Send.

Done.


pkg/rpc/nodedialer/nodedialer.go line 342 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Curious if you think this deserves to be pooled. It's not a large allocation, but it is an allocation; at what point is the hassle of pooling it worth it? I think it should be easy to pool this one.

Done in a new commit using a similar approach to what @RaduBerinde suggested in #136941.

I'll let you give that commit a review before merging.


pkg/rpc/stream_pool_test.go line 290 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

And the stream should not be back in the pool, right?

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice!

Reviewed 6 of 6 files at r6, 2 of 2 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @cthumuluru-crdb, @kyle-a-wong, and @nvanbenschoten)


-- commits line 118 at r7:
nice that this shows up!

@nvanbenschoten
Copy link
Member Author

TFTR!

bors r=tbg

craig bot pushed a commit that referenced this pull request Dec 10, 2024
136648: rpc: reuse gRPC streams across unary BatchRequest RPCs r=tbg a=nvanbenschoten

Closes #136572.

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

### Microbenchmarks:
```
name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)
```

### Roachtests:
```
name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)
```

### Manual tests:
Running in a similar configuration to `sysbench/oltp_read_write/nodes=3/cpu=8/conc=64`, but with a benchmarking related cluster settings (before and after) to reduce variance.
```
-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091
```

----

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the `rpc.batch_stream_pool.enabled` cluster setting.

137059: catalog/lease: deflake TestDescriptorRefreshOnRetry r=rafiss a=rafiss

The test was flaky since the background thread to refresh leases could run and cause the acquisition counts to be off.

fixes #137033
Release note: None

137067: roachtest: update mt-upgrade test owner to db-server r=rimadeodhar a=rimadeodhar

This PR updates the test ownership for the multitenant-upgrade test to the DB Server team. All future test failures will be routed to `t-db-server` for triage.

Epic: none
Release note: none

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: rimadeodhar <[email protected]>
@craig
Copy link
Contributor

craig bot commented Dec 10, 2024

Build failed (retrying...):

@craig craig bot merged commit 95d95a6 into cockroachdb:master Dec 10, 2024
23 checks passed
tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2024
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/gRPCUnaryReuse branch December 13, 2024 00:14
tbg added a commit to tbg/cockroach that referenced this pull request Dec 16, 2024
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
tbg added a commit to tbg/cockroach that referenced this pull request Jan 6, 2025
I cherry-picked the stream reuse PR cockroachdb#136648 so that both are
more likely to evolve in unison, and because I anticipated
"piggybacking" on top of it. This did not materialize, but
the important integration point is now side by side and should
lend itself well to a future change that adds stream reuse for
drpc as well.

This commit roughly encompasses the following:

1.  we change `*rpc.Connection` to also maintain a `drpcpool.Conn`.

It took me a while to grok how the `drpcpool.Pool` is architected.
Essentially, its `Get()` method gives you a handle to a `drpcpool.Conn`
that reflects a specific use case of the pool. When a client is created
on it, it gets assigned an actual `drpc.Conn` from the pool (dialing if
necessary), and when the client is closed, this conn is returned to the
pool.

So in a sense, `drpcpool.Conn` is an actual pool for some keyed use
case; `drpcpool.Pool` is simply the bucket in which the actual
connections get pooled, but they're never pulled from or returned to it
directly. We don't currently use the key since we create a pool per
remote node, and if we're not sharing TCP conns they all look the same
to us anyway (i.e. there's no point in DefaultClass vs SystemClass).

If we squint, a `drpcoool.Conn` parallels `*grpc.ClientConn` in the sense
that you can "just" make multiple clients on top of it.  Of course
internally a `*grpc.ClientConn` multiplexes multiple clients over one
HTTP2 connection, whereas a `drpcpool.Conn` represents multiple
independent TCP connections.

The lifecycle of these pools is currently completely broken, and I was
honestly surprised TestTestServerDRPC passes! Future commits will need
to clean this up to the point where we can at least run a representative
benchmark.
cthumuluru-crdb added a commit to cthumuluru-crdb/cockroach that referenced this pull request Jan 8, 2025
Reuse drpc streams across unary operations. This is similar to the
functionality implemented for gRPC in github.com/cockroachdb/pull/136648.

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Jan 14, 2025
138368: all: host a DRPC server and use it for node-node BatchRequests r=cthumuluru-crdb a=cthumuluru-crdb

drpc (github.com/storj/drpc) is a lightweight, drop-in replacement for gRPC.  Initial benchmark results from the Perf Efficiency team, using a simple ping service (github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy),
demonstrate that switching from gRPC to drpc delivers significant performance improvements.

```
$ benchdiff --old lastmerge ./pkg/sql/tests -b -r 'Sysbench/SQL/3node/oltp_read_write' -d 1000x -c 10

name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    14.0ms ± 2%    14.1ms ± 1%     ~     (p=0.063 n=10+10)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.55MB ± 1%    2.26MB ± 2%  -11.39%  (p=0.000 n=9+9)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     17.2k ± 1%     11.7k ± 0%  -32.30%  (p=0.000 n=10+8)
```

This pull request introduces experimental support for using drpc to serve BatchRequests. The feature is disabled by default but can be enabled explicitly by setting the `COCKROACH_EXPERIMENTAL_DRPC_ENABLED` environment variable or automatically in CRDB test builds. The prototype lacks key features, such as authorization checks, interceptors, etc, and is not production-ready.

When DRPC is disabled, sever, muxer and listener mock implementations are injected. If it is enabled real implementations are used. DRPC is only used for BatchRequests. This PR also added support for pooled streaming unary operations similar to #136648.

Added unit tests to ensure:
- CRDB nodes can server BatchRequests.
- Simple queries when DRPC is enabled or disabled.

Note: drpc protocol generation is not yet integrated into the build process. Protobuf files were generated manually and committed directly to the repository.

This PR has a few parts:

- it adds a copy of drpc generated code for a service that supports unary BatchRequest.
- in `rpc.NewServerEx`, we now also set up a drpc server and return it up the stack.
- Registers BatchRequest service with both DRPC and gRPC. If DRPC is enabled, it is
used to make BatchRequests, if not gRPC is used.
- in our listener setup, we configure cmux to match on the `DRPC!!!1` header and serve the resulting listener on the drpc server. The drpc example uses `drpcmigrate.ListenMux` instead of cmux; we keep cmux but must make sure the header is read off the connection before delegating (which the drpxmigrate mux would have done for us).
- if using TLS, wrap the drpc listener with TLS config and use it to servr DRPC requests.
- add support to reuse drpc streams across unary BatchRequests. However, the DRPC implementation is not on par with the gRPC counterpart in terms of allocation optimizations.

Closes #136794.

Epic: CRDB-42584
Release note: None

Co-authored-by: Chandra Thumuluru <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
o-perf-efficiency Related to performance efficiency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpc: reuse gRPC streams across unary BatchRequest invocations (~ 11.2% cpu)
3 participants