pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

cockroach-teamcity · 2023-03-21T06:01:47Z

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestWatcherReceivesNoopCheckpoints failed with artifacts on master @ b89fa2cc4bc1fb9447ab1009b34b6f354f9618f0:

E230321 06:01:45.483610 3563854 jobs/adopt.go:225  [T1,n1] 1443  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.513442 3569529 jobs/adopt.go:261  [T1,n1] 1444  job 102: resuming execution
E230321 06:01:45.553120 3569529 jobs/adopt.go:225  [T1,n1] 1445  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.593657 3574522 jobs/adopt.go:261  [T1,n1] 1446  job 102: resuming execution
E230321 06:01:45.636754 3574522 jobs/adopt.go:225  [T1,n1] 1447  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.690172 3580391 jobs/adopt.go:261  [T1,n1] 1448  job 102: resuming execution
E230321 06:01:45.718916 3580391 jobs/adopt.go:225  [T1,n1] 1449  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.758229 3584758 jobs/adopt.go:261  [T1,n1] 1450  job 102: resuming execution
E230321 06:01:45.838108 3584758 jobs/adopt.go:225  [T1,n1] 1451  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.894760 3590274 jobs/adopt.go:261  [T1,n1] 1452  job 102: resuming execution
E230321 06:01:45.931624 3590274 jobs/adopt.go:225  [T1,n1] 1453  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:45.965331 3594968 jobs/adopt.go:261  [T1,n1] 1454  job 102: resuming execution
E230321 06:01:45.990390 3594968 jobs/adopt.go:225  [T1,n1] 1455  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.019786 3598906 jobs/adopt.go:261  [T1,n1] 1456  job 102: resuming execution
E230321 06:01:46.045357 3598906 jobs/adopt.go:225  [T1,n1] 1457  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.078729 3603730 jobs/adopt.go:261  [T1,n1] 1458  job 102: resuming execution
E230321 06:01:46.120591 3603730 jobs/adopt.go:225  [T1,n1] 1459  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.170908 3609978 jobs/adopt.go:261  [T1,n1] 1460  job 102: resuming execution
E230321 06:01:46.215022 3609978 jobs/adopt.go:225  [T1,n1] 1461  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.261110 3615655 jobs/adopt.go:261  [T1,n1] 1462  job 102: resuming execution
I230321 06:01:46.289962 3616574 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r44/1:/Table/4{2-3}] 1463  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
E230321 06:01:46.306735 3615655 jobs/adopt.go:225  [T1,n1] 1464  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.334713 3620936 jobs/adopt.go:261  [T1,n1] 1465  job 102: resuming execution
E230321 06:01:46.362985 3620936 jobs/adopt.go:225  [T1,n1] 1466  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.394120 3625687 jobs/adopt.go:261  [T1,n1] 1467  job 102: resuming execution
E230321 06:01:46.426284 3625687 jobs/adopt.go:225  [T1,n1] 1468  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.461944 3630178 jobs/adopt.go:261  [T1,n1] 1469  job 102: resuming execution
E230321 06:01:46.502867 3630178 jobs/adopt.go:225  [T1,n1] 1470  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230321 06:01:46.541098 3635408 jobs/adopt.go:261  [T1,n1] 1471  job 102: resuming execution
    sqlwatcher_test.go:608: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:608
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestWatcherReceivesNoopCheckpoints
I230321 06:01:46.548934 3061498 testutils/testcluster/testcluster.go:144  [-] 1472  TestCluster quiescing nodes
I230321 06:01:46.550552 3084328 sql/stats/automatic_stats.go:507  [T1,n1] 1473  quiescing auto stats refresher
I230321 06:01:46.551057 3418647 jobs/registry.go:1494  [T1,n1] 1474  KEY VISUALIZER job 100: stepping through state succeeded
I230321 06:01:46.551146 3066200 server/start_listen.go:103  [T1,n1] 1475  server shutting down: instructing cmux to stop accepting
I230321 06:01:46.553276 3177803 rpc/context.go:2257  [T1,n1,rnode=1,raddr=127.0.0.1:36559,class=default,rpc] 1476  connection heartbeat loop ended with err: <nil>
W230321 06:01:46.550775 3080693 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 1477  exiting heartbeat loop
W230321 06:01:46.560621 3080693 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1478  exiting heartbeat loop
E230321 06:01:46.560873 3080693 server/server_sql.go:493  [n1] 1479  failed to run update of instance with new session ID: node unavailable; try another peer
E230321 06:01:46.561982 3418647 jobs/adopt.go:472  [T1,n1] 1480  job 100: adoption completed with error job 100: could not mark as succeeded: job 100: select-job: failed to send RPC: sending to all replicas failed; last error: ba: Scan [/Table/15/1/100,/Table/15/1/101), [txn: 17df875c], [can-forward-ts] RPC error: node unavailable; try another peer
E230321 06:01:46.562100 3418647 jobs/registry.go:924  [T1,n1] 1481  error getting live session: node unavailable; try another peer
E230321 06:01:46.566385 3084431 jobs/registry.go:968  [T1,n1] 1482  error expiring job sessions: expire-sessions: failed to read query result: query execution canceled
E230321 06:01:46.566670 3084431 jobs/registry.go:976  [T1,n1] 1483  failed to serve pause and cancel requests: txn exec: context canceled
W230321 06:01:46.566796 3084431 jobs/registry.go:1052  [T1,n1] 1484  canceling all adopted jobs due to stopper quiescing
--- FAIL: TestWatcherReceivesNoopCheckpoints (6.57s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!

Jira issue: CRDB-25698}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-03-22T06:17:28Z

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestWatcherReceivesNoopCheckpoints failed with artifacts on master @ 78bb7696847e1d8abf5d6a199c6dd3414a568691:

I230322 06:17:26.907896 3588863 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r5/1:/{Systemtse-Table/0}] 1162  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230322 06:17:26.984052 3093361 3@pebble/event.go:717  [n1,s1,pebble] 1163  [JOB 7] WAL created 000006
I230322 06:17:26.988110 3085677 1@server/server_sql.go:1668  [T1,n1] 1164  serving sql connections
I230322 06:17:26.989453 3085677 testutils/testcluster/testcluster.go:1348  [-] 1165  WaitForFullReplication
I230322 06:17:26.989565 3085677 testutils/testcluster/testcluster.go:1352  [-] 1166  WaitForFullReplication took: 352ns
I230322 06:17:27.020993 3598358 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1167 ={"Timestamp":1679465847006209505,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.rangefeed.enabled\" = true","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.rangefeed.enabled","Value":"true"}
I230322 06:17:27.043464 3598358 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1168 ={"Timestamp":1679465847025504392,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.closed_timestamp.target_duration\" = '100ms'","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.closed_timestamp.target_duration","Value":"00:00:00.1"}
I230322 06:17:27.072942 3094373 kv/kvserver/replica_rangefeed.go:703  [T1,n1,s1,r10/1:/Table/{6-7}] 1169  RangeFeed closed timestamp 1679465844.037201996,0 is behind by 3.035721539s
I230322 06:17:27.135766 3598358 5@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41414,user=root] 1170 ={"Timestamp":1679465847049452353,"EventType":"create_table","Statement":"CREATE TABLE defaultdb.public.t ()","Tag":"CREATE TABLE","User":"root","DescriptorID":104,"TableName":"defaultdb.public.t"}
I230322 06:17:27.210178 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 1171  established range feed over system.descriptors starting at time 1679465847.048748594,0
I230322 06:17:27.210345 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 1172  established range feed over system.zones starting at time 1679465847.048748594,0
I230322 06:17:27.215104 3616925 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 1173  established range feed over system.protected_ts_records starting at time 1679465847.048748594,0
I230322 06:17:27.229898 3617358 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r34/1:/Table/3{2-3}] 1174  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
    sqlwatcher_test.go:608: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:608
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestWatcherReceivesNoopCheckpoints
I230322 06:17:27.799336 3085677 testutils/testcluster/testcluster.go:144  [-] 1175  TestCluster quiescing nodes
W230322 06:17:27.799706 3611728 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 1176  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/104/2/1: node unavailable; try another peer
I230322 06:17:27.799979 3453599 jobs/registry.go:1494  [T1,n1] 1178  KEY VISUALIZER job 100: stepping through state succeeded
I230322 06:17:27.800376 3090942 server/start_listen.go:103  [T1,n1] 1179  server shutting down: instructing cmux to stop accepting
I230322 06:17:27.804312 3110969 sql/stats/automatic_stats.go:507  [T1,n1] 1180  quiescing auto stats refresher
I230322 06:17:27.805802 3190739 rpc/context.go:2245  [T1,n1,rnode=1,raddr=127.0.0.1:44013,class=default,rpc] 1181  connection heartbeat loop ended with err: <nil>
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177  job 102: running execution encountered retriable error: non-cancelable: context canceled
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +(1) attached stack trace
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  -- stack trace:
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1572
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:467
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func2
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:385
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | runtime.goexit
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +  | 	GOROOT/src/runtime/asm_amd64.s:1594
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Wraps: (2) non-cancelable
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Wraps: (3) context canceled
I230322 06:17:27.799884 3562247 jobs/registry.go:1574  [T1,n1] 1177 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
E230322 06:17:27.806583 3562247 jobs/adopt.go:472  [T1,n1] 1182  job 102: adoption completed with error non-cancelable: context canceled
W230322 06:17:27.806855 3106797 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 1183  exiting heartbeat loop
W230322 06:17:27.806919 3106797 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1184  exiting heartbeat loop
E230322 06:17:27.806966 3106797 server/server_sql.go:503  [n1] 1185  failed to run update of instance with new session ID: node unavailable; try another peer
E230322 06:17:27.809480 3111275 jobs/registry.go:976  [T1,n1] 1186  failed to serve pause and cancel requests: could not query jobs table: cancel/pause-requested: failed to read query result: query execution canceled
W230322 06:17:27.809604 3111275 jobs/registry.go:1052  [T1,n1] 1187  canceling all adopted jobs due to stopper quiescing
E230322 06:17:27.809765 3510324 jobs/registry.go:924  [T1,n1] 1188  error getting live session: node unavailable; try another peer
--- FAIL: TestWatcherReceivesNoopCheckpoints (6.29s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

arulajmani · 2023-03-27T21:09:39Z

There's not much to go off of from the logs. I tried reproducing this on my GCE worker, but no luck

16720 runs so far, 0 failures, over 45m35s
16751 runs so far, 0 failures, over 45m40s
16782 runs so far, 0 failures, over 45m45s
16816 runs so far, 0 failures, over 45m50s
16850 runs so far, 0 failures, over 45m55s
16880 runs so far, 0 failures, over 46m0s
16912 runs so far, 0 failures, over 46m5s
16945 runs so far, 0 failures, over 46m10s

For now, I'm removing the GA-blocker label. I'll keep the issue open and assigned to myself to see if it fails again.

arulajmani · 2023-03-28T00:17:58Z

Update -- I tried running this thing with the build tags from these two failures, and even then, no luck:

11758 runs so far, 0 failures, over 1h40m10s
11771 runs so far, 0 failures, over 1h40m15s
11779 runs so far, 0 failures, over 1h40m20s
11790 runs so far, 0 failures, over 1h40m25s
11801 runs so far, 0 failures, over 1h40m30s
11809 runs so far, 0 failures, over 1h40m35s
11821 runs so far, 0 failures, over 1h40m40s

irfansharif · 2023-03-30T16:30:42Z

I'm going to re-add this GA-blocker label here given the discussion in #99214. It doesn't seem that it's a spanconfig bug, but a more sever one with rangefeeds possibly not emitting updates, which spanconfigs relies on for correctness.

Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None

Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

100189: kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None 100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection. Epic: none Fixes: #100391 Fixes: #77754 Informs: #80034 Release note: None 100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w The test failure is showing more total wide scans than expected. Change the compact stats job to run once a year to avoid it running at the same time as the test. The interceptor is disabled right after delete reducing the possibility of another operation causing a conflict. Epic: none closes: #99653 Release note: none 100589: allocator: deflake full disk test r=andrewbaptist a=kvoli In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes. This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold ``` dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress ... 15714 runs so far, 0 failures, over 39m45s ``` Fixes: #100033 Release note: None 100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner. This change ensures roachtest explicitly sets config.Quiet to true. [1] #99133 Epic: none Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]>

Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None

100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None

miretskiy · 2023-04-13T13:28:18Z

@arulajmani this issue hasn't triggered for 2 weeks on the master. Did you disable metamorphic tests?
If so, I would love to see if this issue still reproduces with the recent muxrf changes.

arulajmani · 2023-04-18T14:12:19Z

this issue hasn't triggered for 2 weeks on the master. Did you disable metamorphic tests?

No, we didn't disable metamorphic tests. 2 weeks is a good sign, but FWIW, this wasn't reproducing very readily.

github-actions · 2023-05-22T10:06:08Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 21, 2023

cockroach-teamcity added this to the 23.1 milestone Mar 21, 2023

blathers-crl bot added the T-kv KV Team label Mar 21, 2023

irfansharif assigned arulajmani Mar 22, 2023

irfansharif added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 22, 2023

arulajmani removed the GA-blocker label Mar 27, 2023

irfansharif mentioned this issue Mar 29, 2023

spanconfig/spanconfigkvsubscriber: TestDataDriven failed #99214

Closed

irfansharif added the GA-blocker label Mar 30, 2023

miretskiy mentioned this issue Mar 30, 2023

kvcoord: Restart ranges on a dedicated goroutine. #100189

Merged

miretskiy mentioned this issue Apr 4, 2023

kvcoord: Rework error propagation in mux rangefeed #100649

Merged

miretskiy mentioned this issue Apr 4, 2023

release-23.1: kvcoord: Restart ranges on a dedicated goroutine. #100650

Merged

miretskiy mentioned this issue Apr 5, 2023

kvcoord, changefeedccl: Disable mux rangefeed #100713

Merged

blathers-crl bot mentioned this issue Apr 13, 2023

release-23.1: kvcoord: Rework error propagation in mux rangefeed #101405

Merged

blathers-crl bot mentioned this issue Apr 13, 2023

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

Merged

nvanbenschoten removed the GA-blocker label Apr 19, 2023

github-actions bot added no-test-failure-activity X-stale labels May 22, 2023

github-actions bot closed this as completed May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

cockroach-teamcity commented Mar 21, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Mar 22, 2023

arulajmani commented Mar 27, 2023 •

edited

Loading

arulajmani commented Mar 28, 2023

irfansharif commented Mar 30, 2023

miretskiy commented Apr 13, 2023

arulajmani commented Apr 18, 2023

github-actions bot commented May 22, 2023

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestWatcherReceivesNoopCheckpoints failed #99092

Comments

cockroach-teamcity commented Mar 21, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Mar 22, 2023

arulajmani commented Mar 27, 2023 • edited Loading

arulajmani commented Mar 28, 2023

irfansharif commented Mar 30, 2023

miretskiy commented Apr 13, 2023

arulajmani commented Apr 18, 2023

github-actions bot commented May 22, 2023

cockroach-teamcity commented Mar 21, 2023 •

edited by cockroach-jira-scripts

Loading

arulajmani commented Mar 27, 2023 •

edited

Loading