pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

cockroach-teamcity · 2023-03-18T06:04:39Z

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestSQLWatcherMultiple failed with artifacts on master @ ad4463c6f2759b1a729184097b315a6b46f8d922:

I230318 06:04:34.002940 7180 jobs/adopt.go:261  [T1,n1] 962  job 102: resuming execution
E230318 06:04:34.006757 7180 jobs/adopt.go:225  [T1,n1] 963  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.020786 7185 jobs/adopt.go:261  [T1,n1] 964  job 102: resuming execution
E230318 06:04:34.025350 7185 jobs/adopt.go:225  [T1,n1] 965  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.039710 7036 jobs/adopt.go:261  [T1,n1] 966  job 102: resuming execution
E230318 06:04:34.058460 7036 jobs/adopt.go:225  [T1,n1] 967  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.073589 7199 jobs/adopt.go:261  [T1,n1] 968  job 102: resuming execution
E230318 06:04:34.077416 7199 jobs/adopt.go:225  [T1,n1] 969  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.090596 7236 jobs/adopt.go:261  [T1,n1] 970  job 102: resuming execution
E230318 06:04:34.094634 7236 jobs/adopt.go:225  [T1,n1] 971  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.109017 7242 jobs/adopt.go:261  [T1,n1] 972  job 102: resuming execution
E230318 06:04:34.112793 7242 jobs/adopt.go:225  [T1,n1] 973  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.127350 7248 jobs/adopt.go:261  [T1,n1] 974  job 102: resuming execution
E230318 06:04:34.131071 7248 jobs/adopt.go:225  [T1,n1] 975  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.145815 7229 jobs/adopt.go:261  [T1,n1] 976  job 102: resuming execution
E230318 06:04:34.149665 7229 jobs/adopt.go:225  [T1,n1] 977  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.167616 7285 jobs/adopt.go:261  [T1,n1] 978  job 102: resuming execution
E230318 06:04:34.175317 7285 jobs/adopt.go:225  [T1,n1] 979  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.189554 7216 jobs/adopt.go:261  [T1,n1] 980  job 102: resuming execution
E230318 06:04:34.192795 7216 jobs/adopt.go:225  [T1,n1] 981  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.206846 7304 jobs/adopt.go:261  [T1,n1] 982  job 102: resuming execution
E230318 06:04:34.211430 7304 jobs/adopt.go:225  [T1,n1] 983  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.226543 7310 jobs/adopt.go:261  [T1,n1] 984  job 102: resuming execution
E230318 06:04:34.231578 7310 jobs/adopt.go:225  [T1,n1] 985  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.245408 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 986  established range feed over system.descriptors starting at time 1679119473.687549027,0
I230318 06:04:34.245464 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 987  established range feed over system.zones starting at time 1679119473.687549027,0
I230318 06:04:34.245507 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 988  established range feed over system.protected_ts_records starting at time 1679119473.687549027,0
I230318 06:04:34.245581 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 989  established range feed over system.descriptors starting at time 1679119473.687549027,0
I230318 06:04:34.245618 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 990  established range feed over system.zones starting at time 1679119473.687549027,0
I230318 06:04:34.245655 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 991  established range feed over system.protected_ts_records starting at time 1679119473.687549027,0
I230318 06:04:34.247033 7133 jobs/adopt.go:261  [T1,n1] 992  job 102: resuming execution
    sqlwatcher_test.go:346: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:346
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestSQLWatcherMultiple
I230318 06:04:34.248978 5157 testutils/testcluster/testcluster.go:148  [-] 993  TestCluster quiescing nodes
I230318 06:04:34.249234 6415 jobs/registry.go:1494  [T1,n1] 994  KEY VISUALIZER job 100: stepping through state succeeded
W230318 06:04:34.249319 6384 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 995  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/102/2/1: node unavailable; try another peer
I230318 06:04:34.249372 4952 server/start_listen.go:103  [T1,n1] 996  server shutting down: instructing cmux to stop accepting
W230318 06:04:34.249542 5087 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 998  exiting heartbeat loop
I230318 06:04:34.249552 6954 rpc/context.go:2249  [T1,n1,rnode=1,raddr=127.0.0.1:42991,class=default,rpc] 999  connection heartbeat loop ended with err: <nil>
W230318 06:04:34.249570 5087 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1000  exiting heartbeat loop
E230318 06:04:34.249586 5087 server/server_sql.go:490  [n1] 1001  failed to run update of instance with new session ID: node unavailable; try another peer
I230318 06:04:34.249739 5611 sql/stats/automatic_stats.go:507  [T1,n1] 1002  quiescing auto stats refresher
W230318 06:04:34.249527 5617 jobs/registry.go:1052  [T1,n1] 997  canceling all adopted jobs due to stopper quiescing
E230318 06:04:34.250980 6635 jobs/registry.go:924  [T1,n1] 1003  error getting live session: node unavailable; try another peer
--- FAIL: TestSQLWatcherMultiple (1.71s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!

Jira issue: CRDB-25594}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-03-19T06:21:57Z

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestSQLWatcherMultiple failed with artifacts on master @ 53dbb86acb1d48309530181b94838faf937084d3:

I230319 06:21:35.730407 1511580 3@pebble/event.go:734  [n1,s1,pebble] 758  [JOB 7] WAL created 000006
I230319 06:21:35.734662 1504562 1@server/server_sql.go:1644  [T1,n1] 759  serving sql connections
I230319 06:21:35.736259 1504562 testutils/testcluster/testcluster.go:1352  [-] 760  WaitForFullReplication
I230319 06:21:35.736387 1504562 testutils/testcluster/testcluster.go:1356  [-] 761  WaitForFullReplication took: 335ns
I230319 06:21:35.752216 2026225 jobs/adopt.go:261  [T1,n1] 762  job 102: resuming execution
I230319 06:21:35.772128 2026047 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 763 ={"Timestamp":1679206895754185989,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.rangefeed.enabled\" = true","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.rangefeed.enabled","Value":"true"}
I230319 06:21:35.791624 2026047 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 764 ={"Timestamp":1679206895773744773,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.closed_timestamp.target_duration\" = '100ms'","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.closed_timestamp.target_duration","Value":"00:00:00.1"}
E230319 06:21:35.805783 2026225 jobs/adopt.go:225  [T1,n1] 765  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:35.809015 2033139 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r3/1:/System/{NodeLive…-tsd}] 766  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230319 06:21:35.826393 2026047 5@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 767 ={"Timestamp":1679206895792702326,"EventType":"create_table","Statement":"CREATE TABLE defaultdb.public.t ()","Tag":"CREATE TABLE","User":"root","DescriptorID":104,"TableName":"defaultdb.public.t"}
I230319 06:21:35.871352 2039969 jobs/adopt.go:261  [T1,n1] 768  job 102: resuming execution
E230319 06:21:35.899255 2039969 jobs/adopt.go:225  [T1,n1] 769  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:35.900420 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 770  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:35.905270 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 771  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:35.906206 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 772  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:35.911513 2044100 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r34/1:/Table/3{2-3}] 773  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230319 06:21:35.945009 2046031 jobs/adopt.go:261  [T1,n1] 774  job 102: resuming execution
E230319 06:21:35.992199 2046031 jobs/adopt.go:225  [T1,n1] 775  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.029074 2050686 jobs/adopt.go:261  [T1,n1] 776  job 102: resuming execution
E230319 06:21:36.064120 2050686 jobs/adopt.go:225  [T1,n1] 777  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.095311 2055455 jobs/adopt.go:261  [T1,n1] 778  job 102: resuming execution
E230319 06:21:36.121540 2055455 jobs/adopt.go:225  [T1,n1] 779  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.158237 2059792 jobs/adopt.go:261  [T1,n1] 780  job 102: resuming execution
    sqlwatcher_test.go:346: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:346
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestSQLWatcherMultiple
I230319 06:21:36.175683 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 781  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:36.175819 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 782  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:36.175806 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 783  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:36.176231 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 784  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:36.176556 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 785  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:36.176901 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 786  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:36.181453 1504562 testutils/testcluster/testcluster.go:148  [-] 787  TestCluster quiescing nodes
I230319 06:21:36.182024 1509233 server/start_listen.go:103  [T1,n1] 788  server shutting down: instructing cmux to stop accepting
W230319 06:21:36.182828 2038349 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 789  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/104/2/1: node unavailable; try another peer
I230319 06:21:36.182942 1524017 sql/stats/automatic_stats.go:507  [T1,n1] 790  quiescing auto stats refresher
I230319 06:21:36.184372 1886025 jobs/registry.go:1494  [T1,n1] 791  KEY VISUALIZER job 100: stepping through state succeeded
W230319 06:21:36.183006 1522168 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 792  exiting heartbeat loop
W230319 06:21:36.184752 1522168 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 793  exiting heartbeat loop
E230319 06:21:36.184795 1522168 server/server_sql.go:490  [n1] 794  failed to run update of instance with new session ID: node unavailable; try another peer
I230319 06:21:36.183743 1609055 rpc/context.go:2249  [T1,n1,rnode=1,raddr=127.0.0.1:35305,class=default,rpc] 795  connection heartbeat loop ended with err: <nil>
E230319 06:21:36.186132 1524057 jobs/registry.go:968  [T1,n1] 796  error expiring job sessions: expire-sessions: failed to read query result: query execution canceled
E230319 06:21:36.186791 1524057 jobs/registry.go:976  [T1,n1] 797  failed to serve pause and cancel requests: txn exec: context canceled
W230319 06:21:36.186954 1524057 jobs/registry.go:1052  [T1,n1] 798  canceling all adopted jobs due to stopper quiescing
E230319 06:21:36.187169 1938660 jobs/registry.go:924  [T1,n1] 799  error getting live session: node unavailable; try another peer
--- FAIL: TestSQLWatcherMultiple (6.72s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

irfansharif · 2023-03-29T21:03:57Z

12959 runs so far, 0 failures, over 9m20s

Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None

Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

100189: kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None 100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection. Epic: none Fixes: #100391 Fixes: #77754 Informs: #80034 Release note: None 100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w The test failure is showing more total wide scans than expected. Change the compact stats job to run once a year to avoid it running at the same time as the test. The interceptor is disabled right after delete reducing the possibility of another operation causing a conflict. Epic: none closes: #99653 Release note: none 100589: allocator: deflake full disk test r=andrewbaptist a=kvoli In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes. This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold ``` dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress ... 15714 runs so far, 0 failures, over 39m45s ``` Fixes: #100033 Release note: None 100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner. This change ensures roachtest explicitly sets config.Quiet to true. [1] #99133 Epic: none Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]>

Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None

100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None

miretskiy · 2023-04-13T13:27:12Z

@irfansharif were you able to reproduce this failure? Seems like nothing happened for two weeks. Why is that? was there something that you did on your side (disable metamorphic tests perhpaps)?

At any rate, I would love to know if this issue reproduces w/ mux rf with recent changes.

github-actions · 2023-05-15T10:06:10Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 18, 2023

cockroach-teamcity added this to the 23.1 milestone Mar 18, 2023

blathers-crl bot added the T-kv KV Team label Mar 18, 2023

irfansharif mentioned this issue Mar 29, 2023

spanconfig/spanconfigkvsubscriber: TestDataDriven failed #99214

Closed

miretskiy mentioned this issue Mar 30, 2023

kvcoord: Restart ranges on a dedicated goroutine. #100189

Merged

exalate-issue-sync bot assigned irfansharif Apr 3, 2023

This was referenced Apr 4, 2023

kvcoord: Rework error propagation in mux rangefeed #100649

Merged

release-23.1: kvcoord: Restart ranges on a dedicated goroutine. #100650

Merged

blathers-crl bot mentioned this issue Apr 13, 2023

release-23.1: kvcoord: Rework error propagation in mux rangefeed #101405

Merged

blathers-crl bot mentioned this issue Apr 13, 2023

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

Merged

github-actions bot added the no-test-failure-activity label May 15, 2023

github-actions bot added the X-stale label May 22, 2023

github-actions bot closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

cockroach-teamcity commented Mar 18, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Mar 19, 2023

irfansharif commented Mar 29, 2023

miretskiy commented Apr 13, 2023

github-actions bot commented May 15, 2023

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

Comments

cockroach-teamcity commented Mar 18, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Mar 19, 2023

irfansharif commented Mar 29, 2023

miretskiy commented Apr 13, 2023

github-actions bot commented May 15, 2023

cockroach-teamcity commented Mar 18, 2023 •

edited by cockroach-jira-scripts

Loading