Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

Closed
cockroach-teamcity opened this issue Mar 21, 2023 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 21, 2023

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestWatcherReceivesNoopCheckpoints failed with artifacts on master @ b89fa2cc4bc1fb9447ab1009b34b6f354f9618f0:

E230321 06:01:45.483610 3563854 jobs/adopt.go:225  [T1,n1] 1443  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.513442 3569529 jobs/adopt.go:261  [T1,n1] 1444  job 102: resuming execution
E230321 06:01:45.553120 3569529 jobs/adopt.go:225  [T1,n1] 1445  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.593657 3574522 jobs/adopt.go:261  [T1,n1] 1446  job 102: resuming execution
E230321 06:01:45.636754 3574522 jobs/adopt.go:225  [T1,n1] 1447  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.690172 3580391 jobs/adopt.go:261  [T1,n1] 1448  job 102: resuming execution
E230321 06:01:45.718916 3580391 jobs/adopt.go:225  [T1,n1] 1449  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.758229 3584758 jobs/adopt.go:261  [T1,n1] 1450  job 102: resuming execution
E230321 06:01:45.838108 3584758 jobs/adopt.go:225  [T1,n1] 1451  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.894760 3590274 jobs/adopt.go:261  [T1,n1] 1452  job 102: resuming execution
E230321 06:01:45.931624 3590274 jobs/adopt.go:225  [T1,n1] 1453  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.965331 3594968 jobs/adopt.go:261  [T1,n1] 1454  job 102: resuming execution
E230321 06:01:45.990390 3594968 jobs/adopt.go:225  [T1,n1] 1455  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.019786 3598906 jobs/adopt.go:261  [T1,n1] 1456  job 102: resuming execution
E230321 06:01:46.045357 3598906 jobs/adopt.go:225  [T1,n1] 1457  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.078729 3603730 jobs/adopt.go:261  [T1,n1] 1458  job 102: resuming execution
E230321 06:01:46.120591 3603730 jobs/adopt.go:225  [T1,n1] 1459  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.170908 3609978 jobs/adopt.go:261  [T1,n1] 1460  job 102: resuming execution
E230321 06:01:46.215022 3609978 jobs/adopt.go:225  [T1,n1] 1461  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.261110 3615655 jobs/adopt.go:261  [T1,n1] 1462  job 102: resuming execution
I230321 06:01:46.289962 3616574 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r44/1:/Table/4{2-3}] 1463  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
E230321 06:01:46.306735 3615655 jobs/adopt.go:225  [T1,n1] 1464  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.334713 3620936 jobs/adopt.go:261  [T1,n1] 1465  job 102: resuming execution
E230321 06:01:46.362985 3620936 jobs/adopt.go:225  [T1,n1] 1466  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.394120 3625687 jobs/adopt.go:261  [T1,n1] 1467  job 102: resuming execution
E230321 06:01:46.426284 3625687 jobs/adopt.go:225  [T1,n1] 1468  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.461944 3630178 jobs/adopt.go:261  [T1,n1] 1469  job 102: resuming execution
E230321 06:01:46.502867 3630178 jobs/adopt.go:225  [T1,n1] 1470  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.541098 3635408 jobs/adopt.go:261  [T1,n1] 1471  job 102: resuming execution
    sqlwatcher_test.go:608: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:608
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestWatcherReceivesNoopCheckpoints
I230321 06:01:46.548934 3061498 testutils/testcluster/testcluster.go:144  [-] 1472  TestCluster quiescing nodes
I230321 06:01:46.550552 3084328 sql/stats/automatic_stats.go:507  [T1,n1] 1473  quiescing auto stats refresher
I230321 06:01:46.551057 3418647 jobs/registry.go:1494  [T1,n1] 1474  KEY VISUALIZER job 100: stepping through state succeeded
I230321 06:01:46.551146 3066200 server/start_listen.go:103  [T1,n1] 1475  server shutting down: instructing cmux to stop accepting
I230321 06:01:46.553276 3177803 rpc/context.go:2257  [T1,n1,rnode=1,raddr=127.0.0.1:36559,class=default,rpc] 1476  connection heartbeat loop ended with err: <nil>
W230321 06:01:46.550775 3080693 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 1477  exiting heartbeat loop
W230321 06:01:46.560621 3080693 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1478  exiting heartbeat loop
E230321 06:01:46.560873 3080693 server/server_sql.go:493  [n1] 1479  failed to run update of instance with new session ID: node unavailable; try another peer
E230321 06:01:46.561982 3418647 jobs/adopt.go:472  [T1,n1] 1480  job 100: adoption completed with error job 100: could not mark as succeeded: job 100: select-job: failed to send RPC: sending to all replicas failed; last error: ba: Scan [/Table/15/1/100,/Table/15/1/101), [txn: 17df875c], [can-forward-ts] RPC error: node unavailable; try another peer
E230321 06:01:46.562100 3418647 jobs/registry.go:924  [T1,n1] 1481  error getting live session: node unavailable; try another peer
E230321 06:01:46.566385 3084431 jobs/registry.go:968  [T1,n1] 1482  error expiring job sessions: expire-sessions: failed to read query result: query execution canceled
E230321 06:01:46.566670 3084431 jobs/registry.go:976  [T1,n1] 1483  failed to serve pause and cancel requests: txn exec: context canceled
W230321 06:01:46.566796 3084431 jobs/registry.go:1052  [T1,n1] 1484  canceling all adopted jobs due to stopper quiescing
--- FAIL: TestWatcherReceivesNoopCheckpoints (6.57s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-25698

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 21, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 21, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 21, 2023
@cockroach-teamcity
Copy link
Member Author

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestWatcherReceivesNoopCheckpoints failed with artifacts on master @ 78bb7696847e1d8abf5d6a199c6dd3414a568691:

I230322 06:17:26.907896 3588863 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r5/1:/{Systemtse-Table/0}] 1162  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230322 06:17:26.984052 3093361 3@pebble/event.go:717  [n1,s1,pebble] 1163  [JOB 7] WAL created 000006
I230322 06:17:26.988110 3085677 1@server/server_sql.go:1668  [T1,n1] 1164  serving sql connections
I230322 06:17:26.989453 3085677 testutils/testcluster/testcluster.go:1348  [-] 1165  WaitForFullReplication
I230322 06:17:26.989565 3085677 testutils/testcluster/testcluster.go:1352  [-] 1166  WaitForFullReplication took: 352ns
I230322 06:17:27.020993 3598358 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1167 ={"Timestamp":1679465847006209505,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.rangefeed.enabled\" = true","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.rangefeed.enabled","Value":"true"}
I230322 06:17:27.043464 3598358 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1168 ={"Timestamp":1679465847025504392,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.closed_timestamp.target_duration\" = '100ms'","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.closed_timestamp.target_duration","Value":"00:00:00.1"}
I230322 06:17:27.072942 3094373 kv/kvserver/replica_rangefeed.go:703  [T1,n1,s1,r10/1:/Table/{6-7}] 1169  RangeFeed closed timestamp 1679465844.037201996,0 is behind by 3.035721539s
I230322 06:17:27.135766 3598358 5@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1170 ={"Timestamp":1679465847049452353,"EventType":"create_table","Statement":"CREATE TABLE defaultdb.public.t ()","Tag":"CREATE TABLE","User":"root","DescriptorID":104,"TableName":"defaultdb.public.t"}
I230322 06:17:27.210178 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 1171  established range feed over system.descriptors starting at time 1679465847.048748594,0
I230322 06:17:27.210345 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 1172  established range feed over system.zones starting at time 1679465847.048748594,0
I230322 06:17:27.215104 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 1173  established range feed over system.protected_ts_records starting at time 1679465847.048748594,0
I230322 06:17:27.229898 3617358 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r34/1:/Table/3{2-3}] 1174  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
    sqlwatcher_test.go:608: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:608
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestWatcherReceivesNoopCheckpoints
I230322 06:17:27.799336 3085677 testutils/testcluster/testcluster.go:144  [-] 1175  TestCluster quiescing nodes
W230322 06:17:27.799706 3611728 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 1176  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/104/2/1: node unavailable; try another peer
I230322 06:17:27.799979 3453599 jobs/registry.go:1494  [T1,n1] 1178  KEY VISUALIZER job 100: stepping through state succeeded
I230322 06:17:27.800376 3090942 server/start_listen.go:103  [T1,n1] 1179  server shutting down: instructing cmux to stop accepting
I230322 06:17:27.804312 3110969 sql/stats/automatic_stats.go:507  [T1,n1] 1180  quiescing auto stats refresher
I230322 06:17:27.805802 3190739 rpc/context.go:2245  [T1,n1,rnode=1,raddr=127.0.0.1:44013,class=default,rpc] 1181  connection heartbeat loop ended with err: <nil>
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177  job 102: running execution encountered retriable error: non-cancelable: context canceled
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +(1) attached stack trace
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  -- stack trace:
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1572
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:467
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func2
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:385
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | runtime.goexit
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	GOROOT/src/runtime/asm_amd64.s:1594
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Wraps: (2) non-cancelable
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Wraps: (3) context canceled
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
E230322 06:17:27.806583 3562247 jobs/adopt.go:472  [T1,n1] 1182  job 102: adoption completed with error non-cancelable: context canceled
W230322 06:17:27.806855 3106797 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 1183  exiting heartbeat loop
W230322 06:17:27.806919 3106797 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1184  exiting heartbeat loop
E230322 06:17:27.806966 3106797 server/server_sql.go:503  [n1] 1185  failed to run update of instance with new session ID: node unavailable; try another peer
E230322 06:17:27.809480 3111275 jobs/registry.go:976  [T1,n1] 1186  failed to serve pause and cancel requests: could not query jobs table: cancel/pause-requested: failed to read query result: query execution canceled
W230322 06:17:27.809604 3111275 jobs/registry.go:1052  [T1,n1] 1187  canceling all adopted jobs due to stopper quiescing
E230322 06:17:27.809765 3510324 jobs/registry.go:924  [T1,n1] 1188  error getting live session: node unavailable; try another peer
--- FAIL: TestWatcherReceivesNoopCheckpoints (6.29s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@irfansharif irfansharif added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 22, 2023
@arulajmani
Copy link
Collaborator

arulajmani commented Mar 27, 2023

There's not much to go off of from the logs. I tried reproducing this on my GCE worker, but no luck

16720 runs so far, 0 failures, over 45m35s
16751 runs so far, 0 failures, over 45m40s
16782 runs so far, 0 failures, over 45m45s
16816 runs so far, 0 failures, over 45m50s
16850 runs so far, 0 failures, over 45m55s
16880 runs so far, 0 failures, over 46m0s
16912 runs so far, 0 failures, over 46m5s
16945 runs so far, 0 failures, over 46m10s

For now, I'm removing the GA-blocker label. I'll keep the issue open and assigned to myself to see if it fails again.

@arulajmani
Copy link
Collaborator

Update -- I tried running this thing with the build tags from these two failures, and even then, no luck:

11758 runs so far, 0 failures, over 1h40m10s
11771 runs so far, 0 failures, over 1h40m15s
11779 runs so far, 0 failures, over 1h40m20s
11790 runs so far, 0 failures, over 1h40m25s
11801 runs so far, 0 failures, over 1h40m30s
11809 runs so far, 0 failures, over 1h40m35s
11821 runs so far, 0 failures, over 1h40m40s

@irfansharif
Copy link
Contributor

I'm going to re-add this GA-blocker label here given the discussion in #99214. It doesn't seem that it's a spanconfig bug, but a more sever one with rangefeeds possibly not emitting updates, which spanconfigs relies on for correctness.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 30, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 31, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 3, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
craig bot pushed a commit that referenced this issue Apr 4, 2023
100189:  kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy

Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist

We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection.

Epic: none
Fixes: #100391
Fixes: #77754
Informs: #80034

Release note: None

100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w

The test failure is showing more total wide scans
than expected. Change the compact stats job to run
once a year to avoid it running at the same time
as the test.

The interceptor is disabled right after delete
reducing the possibility of another operation
causing a conflict.

Epic: none
closes: #99653

Release note: none

100589: allocator: deflake full disk test r=andrewbaptist a=kvoli

In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes.

This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold

```
dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress
...
15714 runs so far, 0 failures, over 39m45s
```

Fixes: #100033

Release note: None

100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg

After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner.

This change ensures roachtest explicitly sets config.Quiet to true.

[1] #99133

Epic: none

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 7, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 9, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 12, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
craig bot pushed a commit that referenced this issue Apr 13, 2023
100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error.

Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later.

It appears that completing future multiple times should be okay; however, it is still a bit worrysome.  The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent the error when the future completed.  Prior to this change, this happened inline (via `WhenReady` closure).  This is dangerous since this closure may run when important locks (such as raft mu) are being held.  What could happen is that mux rangefeed encounters a retryable error.  The future is prepared with error value, which causes an error to be sent to the client.  This happens with some lock being held.  The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory.  Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea.  This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
@miretskiy
Copy link
Contributor

@arulajmani this issue hasn't triggered for 2 weeks on the master. Did you disable metamorphic tests?
If so, I would love to see if this issue still reproduces with the recent muxrf changes.

@arulajmani
Copy link
Collaborator

this issue hasn't triggered for 2 weeks on the master. Did you disable metamorphic tests?

No, we didn't disable metamorphic tests. 2 weeks is a good sign, but FWIW, this wasn't reproducing very readily.

@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. T-kv KV Team X-stale
Projects
None yet
Development

No branches or pull requests

5 participants