Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/catalog/lease: TestDescriptorRefreshOnRetry failed #137033

Closed
cockroach-teamcity opened this issue Dec 9, 2024 · 2 comments · Fixed by #137059
Closed

sql/catalog/lease: TestDescriptorRefreshOnRetry failed #137033

cockroach-teamcity opened this issue Dec 9, 2024 · 2 comments · Fixed by #137059
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 9, 2024

sql/catalog/lease.TestDescriptorRefreshOnRetry failed with artifacts on release-23.2 @ b14741589afbd8876aa6818bc0bbe07ecc4da017:

        	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:163 +0x22
        golang.org/x/sync/errgroup.(*Group).Go.func1()
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78 +0x56
        created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 201018
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x96
        
        goroutine 204774 [select]:
        github.com/cockroachdb/cockroach/pkg/sql.(*runnerCoordinator).init.func1({0xc015aed800, 0xc01344dc01})
        	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:156 +0xee
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:479 +0x13a
        created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 204744
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x415
        
        goroutine 199747 [select]:
        github.com/cockroachdb/cockroach/pkg/sql.(*runnerCoordinator).init.func1({0xc00a9f2c00, 0x0})
        	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:156 +0xee
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:479 +0x13a
        created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 199718
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x415
        
        goroutine 200610 [select]:
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).processEvents(0xc005a3ba20, {0x62966d8, 0xc00f32aff0}, 0xc011dc86c0?, 0xc003f7dec0)
        	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:351 +0xd4
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).run.func2({0x62966d8?, 0xc00f32aff0?})
        	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:308 +0x32
        github.com/cockroachdb/cockroach/pkg/util/ctxgroup.GoAndWait.Group.GoCtx.func1()
        	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:163 +0x22
        golang.org/x/sync/errgroup.(*Group).Go.func1()
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78 +0x56
        created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 199943
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x96
        
        goroutine 205048 [select]:
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).processEvents(0xc0183cedc0, {0x62966d8, 0xc018af6a00}, 0xc012b33110?, 0xc0183dea20)
        	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:351 +0xd4
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).run.func2({0x62966d8?, 0xc018af6a00?})
        	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:308 +0x32
        github.com/cockroachdb/cockroach/pkg/util/ctxgroup.GoAndWait.Group.GoCtx.func1()
        	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:163 +0x22
        golang.org/x/sync/errgroup.(*Group).Go.func1()
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78 +0x56
        created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 204964
        	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x96
        
    lease_test.go:970: invalid descriptor acquisition counts = 2, 0
    panic.go:523: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ce8d73b4a4081f39d36a7fa2c2508fcc/logTestDescriptorRefreshOnRetry13277273
--- FAIL: TestDescriptorRefreshOnRetry (2.48s)

Parameters:

  • TAGS=bazel,gss
  • stress=true
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-45366

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Dec 9, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Dec 9, 2024
@rafiss
Copy link
Collaborator

rafiss commented Dec 9, 2024

I think we finally have the info to debug this (previously seen in #134695 and other issues). These are the the lease acquirer stack traces:

I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031  lease acquirer stack trace: ‹goroutine 207075 [running]:›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹runtime/debug.Stack()›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	GOROOT/src/runtime/debug/stack.go:24 +0x5e›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease_test.TestDescriptorRefreshOnRetry.func1({0x7fc3db1799a8?, 0xc00f816f00?}, {0x2?, 0x6296710?})›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease_test/pkg/sql/catalog/lease/lease_test.go:915 +0x65›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire({0xc015af8c40, {0x62748a0, 0xc01003c630}, 0xc013b1c6c0, 0xc015a3a000, {{0xc01436fa40, 0xc01436fa58}, {0xc01436fa40}, 0x0}, 0xc00754a730, ...}, ...)›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/storage.go:199 +0x5e9›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1({0x62966d8, 0xc0050cc320})›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:554 +0x1a5›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall.func1({0x62966d8?, 0xc0050cc320?})›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:384 +0x2a›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc00857a990, {0x62966d8?, 0xc0050cc320}, {0x6240b50?, 0xc00bfd7ea0?}, 0xc00f3cbe68)›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:314 +0xd3›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall(0xc0143a21e0, {0x6296710?, 0xc00c048f60?}, 0xc00818dce0, {0xc0054f436b, 0x3}, {0xc00857a990?, 0x60?}, 0xc00f829940)›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:383 +0x290›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹created by github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan in goroutine 206814›
I241209 19:06:47.491768 207075 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1031 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:353 +0x52b›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033  lease acquirer stack trace: ‹goroutine 207083 [running]:›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹runtime/debug.Stack()›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	GOROOT/src/runtime/debug/stack.go:24 +0x5e›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease_test.TestDescriptorRefreshOnRetry.func1({0x7fc3db1799a8?, 0xc007b7ea00?}, {0x2?, 0x6296710?})›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease_test/pkg/sql/catalog/lease/lease_test.go:915 +0x65›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire({0xc015af8c40, {0x62748a0, 0xc01003c630}, 0xc013b1c6c0, 0xc015a3a000, {{0xc01436fa40, 0xc01436fa58}, {0xc01436fa40}, 0x0}, 0xc00754a730, ...}, ...)›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/storage.go:199 +0x5e9›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1({0x62966d8, 0xc0050cdd10})›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:554 +0x1a5›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall.func1({0x62966d8?, 0xc0050cdd10?})›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:384 +0x2a›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc00857a990, {0x62966d8?, 0xc0050cdd10}, {0x6240b50?, 0xc00c1b03c0?}, 0xc0049f0e68)›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:314 +0xd3›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall(0xc0143a21e0, {0x6296710?, 0xc00c1b2990?}, 0xc014712f20, {0xc0054f53a8, 0x3}, {0xc00857a990?, 0x60?}, 0xc00fb6ee80)›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:383 +0x290›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹created by github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan in goroutine 204867›
I241209 19:06:47.496278 207083 sql/catalog/lease_test/lease_test.go:915 ⋮ [-] 1033 +‹	github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:353 +0x52b›

Goroutine 206814 is the one that ran CREATE TABLE, which is expected.

Goroutine 204867 is the RefreshLeases goroutine. It seems like we may need to adjust this for this test.

        goroutine 204867 [select]:
        github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).RefreshLeases.func1({0x6296710, 0xc015c8be90})
        	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:1141 +0xf8
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:479 +0x13a
        created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 204695
        	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x415

@rafiss rafiss self-assigned this Dec 9, 2024
@exalate-issue-sync exalate-issue-sync bot added P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 9, 2024
craig bot pushed a commit that referenced this issue Dec 10, 2024
136648: rpc: reuse gRPC streams across unary BatchRequest RPCs r=tbg a=nvanbenschoten

Closes #136572.

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

### Microbenchmarks:
```
name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)
```

### Roachtests:
```
name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)
```

### Manual tests:
Running in a similar configuration to `sysbench/oltp_read_write/nodes=3/cpu=8/conc=64`, but with a benchmarking related cluster settings (before and after) to reduce variance.
```
-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091
```

----

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the `rpc.batch_stream_pool.enabled` cluster setting.

137059: catalog/lease: deflake TestDescriptorRefreshOnRetry r=rafiss a=rafiss

The test was flaky since the background thread to refresh leases could run and cause the acquisition counts to be off.

fixes #137033
Release note: None

137067: roachtest: update mt-upgrade test owner to db-server r=rimadeodhar a=rimadeodhar

This PR updates the test ownership for the multitenant-upgrade test to the DB Server team. All future test failures will be routed to `t-db-server` for triage.

Epic: none
Release note: none

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: rimadeodhar <[email protected]>
craig bot pushed a commit that referenced this issue Dec 10, 2024
136258: kvserver: add TestFlowControlSendQueueRangeSplitMerge test  r=sumeerbhola a=kvoli

Add a new rac2 flow control integration test,
`TestFlowControlSendQueueRangeSplitMerge`.

This test takes the following steps:

```sql
-- We will exhaust the tokens across all streams while admission is blocked on
-- n3, using a single 4 MiB (deduction, the write itself is small) write. Then,
-- we will write a 1 MiB put to the range, split it, write a 1 MiB put to the
-- LHS range, merge the ranges, and write a 1 MiB put to the merged range. We
-- expect that at each stage where a send queue develops n1->s3, the send queue
-- will be flushed by the range merge and range split range operations.``sql
```

Note that the RHS is not written to post-split, pre-merge. See the
relevant comments, this will be resolved via #136649, or some variation,
which enforces the timely replication on subsume requests.

Part of: #132614
Release note: None

136648: rpc: reuse gRPC streams across unary BatchRequest RPCs r=tbg a=nvanbenschoten

Closes #136572.

This commit introduces pooling of gRPC streams that are used to send requests and receive corresponding responses in a manner that mimics unary RPC invocation. Pooling these streams allows for reuse of gRPC resources across calls, as opposed to native unary RPCs, which create a new stream and throw it away for each request (see grpc.invoke).

The new pooling mechanism is used for the Internal/Batch RPC method, which is the dominant RPC method used to communicate between the KV client and KV server. A new Internal/BatchStream RPC method is introduced to allow a client to send and receive BatchRequest/BatchResponse pairs over a long-lived, pooled stream. A pool of these streams is then maintained alongside each gRPC connection. The pool grows and shrinks dynamically based on demand.

The change demonstrates a large performance improvement in both microbenchmarks and full system benchmarks, which reveals just how expensive the gRPC stream setup on each unary RPC is.

### Microbenchmarks:
```
name                                            old time/op    new time/op    delta
Sysbench/KV/1node_remote/oltp_point_select-10     45.9µs ± 1%    28.8µs ± 2%  -37.31%  (p=0.000 n=9+8)
Sysbench/KV/1node_remote/oltp_read_only-10         958µs ± 6%     709µs ± 1%  -26.00%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       3.65ms ± 6%    2.81ms ± 8%  -23.06%  (p=0.000 n=8+9)
Sysbench/KV/1node_remote/oltp_read_write-10       1.77ms ± 5%    1.38ms ± 1%  -22.09%  (p=0.000 n=10+8)
Sysbench/KV/1node_remote/oltp_write_only-10        688µs ± 4%     557µs ± 1%  -19.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_point_select-10     181µs ± 8%     159µs ± 2%  -12.10%  (p=0.000 n=8+9)
Sysbench/SQL/1node_remote/oltp_write_only-10      2.16ms ± 4%    1.92ms ± 3%  -11.08%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_write-10      5.89ms ± 2%    5.36ms ± 1%   -8.89%  (p=0.000 n=9+9)

name                                            old alloc/op   new alloc/op   delta
Sysbench/KV/1node_remote/oltp_point_select-10     16.3kB ± 0%     6.4kB ± 0%  -60.70%  (p=0.000 n=8+10)
Sysbench/KV/1node_remote/oltp_write_only-10        359kB ± 1%     256kB ± 1%  -28.92%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_write_only-10       748kB ± 0%     548kB ± 1%  -26.78%  (p=0.000 n=8+10)
Sysbench/SQL/1node_remote/oltp_point_select-10    40.9kB ± 0%    30.8kB ± 0%  -24.74%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_write-10       1.11MB ± 1%    0.88MB ± 1%  -21.17%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_write-10      2.00MB ± 0%    1.65MB ± 0%  -17.60%  (p=0.000 n=9+10)
Sysbench/KV/1node_remote/oltp_read_only-10         790kB ± 0%     655kB ± 0%  -17.11%  (p=0.000 n=9+9)
Sysbench/SQL/1node_remote/oltp_read_only-10       1.33MB ± 0%    1.19MB ± 0%  -10.97%  (p=0.000 n=10+9)

name                                            old allocs/op  new allocs/op  delta
Sysbench/KV/1node_remote/oltp_point_select-10        210 ± 0%        61 ± 0%  -70.95%  (p=0.000 n=10+10)
Sysbench/KV/1node_remote/oltp_read_only-10         3.98k ± 0%     1.88k ± 0%  -52.68%  (p=0.019 n=6+8)
Sysbench/KV/1node_remote/oltp_read_write-10        7.10k ± 0%     3.47k ± 0%  -51.07%  (p=0.000 n=10+9)
Sysbench/KV/1node_remote/oltp_write_only-10        3.10k ± 0%     1.58k ± 0%  -48.89%  (p=0.000 n=10+9)
Sysbench/SQL/1node_remote/oltp_write_only-10       6.73k ± 0%     3.82k ± 0%  -43.30%  (p=0.000 n=10+10)
Sysbench/SQL/1node_remote/oltp_read_write-10       14.4k ± 0%      9.2k ± 0%  -36.29%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_point_select-10       429 ± 0%       277 ± 0%  -35.46%  (p=0.000 n=9+10)
Sysbench/SQL/1node_remote/oltp_read_only-10        7.52k ± 0%     5.37k ± 0%  -28.60%  (p=0.000 n=10+10)
```

### Roachtests:
```
name                                            old queries/s  new queries/s  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64     17.6k ± 7%     19.2k ± 2%  +9.22%  (p=0.008 n=5+5)

name                                            old avg_ms/op  new avg_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64      72.9 ± 7%      66.6 ± 2%  -8.57%  (p=0.008 n=5+5)

name                                            old p95_ms/op  new p95_ms/op  delta
sysbench/oltp_read_write/nodes=3/cpu=8/conc=64       116 ± 8%       106 ± 3%  -9.02%  (p=0.016 n=5+5)
```

### Manual tests:
Running in a similar configuration to `sysbench/oltp_read_write/nodes=3/cpu=8/conc=64`, but with a benchmarking related cluster settings (before and after) to reduce variance.
```
-- Before
Mean: 19771.03
Median: 19714.22
Standard Deviation: 282.96
Coefficient of variance: .0143

-- After
Mean: 21908.23
Median: 21923.03
Standard Deviation: 200.88
Coefficient of variance: .0091
```

----

Release note (performance improvement): gRPC streams are now pooled across unary intra-cluster RPCs, allowing for reuse of gRPC resources to reduce the cost of remote key-value layer access. This pooling can be disabled using the `rpc.batch_stream_pool.enabled` cluster setting.

137019: roachtest: increase the token return time with disk bandwidth limit r=kvoli a=andrewbaptist

Previously the test would wait 10m for tokens to be returned. Without the disk bandwidth limit set, they typically return almost immediately but with a limit they can take ~30m to return in some cases even after the workload is stopped and the system is idle. This change fixes some of the perturbation/metamorphic/* tests that are hitting this slow token return.

Epic: none
Fixes: #136982
Fixes: #136553
Informs: #137017

Release note: None

137044: kvserver: deflake TestConsistencyQueueRecomputeStats r=miraradeva a=miraradeva

The test manually adds voters and expects a leaseholder to be established before forcing a stats re-computation (which runs on the leaseholder). With leader leases, it might take an extra election timeout for the leader lease to be established after adding the new voters, so the test flaked if the re-computation ran (and failed) before the leaseholder was ready.

This commit makes the test retry the re-computation until a leasholder is established.

Fixes: #136596

Release note: None

137059: catalog/lease: deflake TestDescriptorRefreshOnRetry r=rafiss a=rafiss

The test was flaky since the background thread to refresh leases could run and cause the acquisition counts to be off.

fixes #137033
Release note: None

137099: kvcoord: deflake TestDistSenderReplicaStall r=miraradeva a=miraradeva

The test runs with expiration leases but when fortification is enabled the lease doesn't move off of the stalled replica because the deadlocked leader doesn't step down while it's receiving store liveness support.

This commit ensures fortification is off when expiration leases are used for the test.

Fixes: #136564

Release note: None

137118: crosscluster/logical: update udf test to expect at-least-once r=dt a=dt

We don't provide exactly once so we don't want to test for it.

Release note: none.
Epic: none.

Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Mira Radeva <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig craig bot closed this as completed in #137059 Dec 10, 2024
Copy link

blathers-crl bot commented Dec 10, 2024

Based on the specified backports for linked PR #137059, I applied the following new label(s) to this issue: branch-release-23.1, branch-release-24.1, branch-release-24.2, branch-release-24.3. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants