Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test: TestSQLWatcherMultiple failed #98925

Closed
cockroach-teamcity opened this issue Mar 18, 2023 · 4 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 18, 2023

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestSQLWatcherMultiple failed with artifacts on master @ ad4463c6f2759b1a729184097b315a6b46f8d922:

I230318 06:04:34.002940 7180 jobs/adopt.go:261  [T1,n1] 962  job 102: resuming execution
E230318 06:04:34.006757 7180 jobs/adopt.go:225  [T1,n1] 963  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.020786 7185 jobs/adopt.go:261  [T1,n1] 964  job 102: resuming execution
E230318 06:04:34.025350 7185 jobs/adopt.go:225  [T1,n1] 965  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.039710 7036 jobs/adopt.go:261  [T1,n1] 966  job 102: resuming execution
E230318 06:04:34.058460 7036 jobs/adopt.go:225  [T1,n1] 967  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.073589 7199 jobs/adopt.go:261  [T1,n1] 968  job 102: resuming execution
E230318 06:04:34.077416 7199 jobs/adopt.go:225  [T1,n1] 969  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.090596 7236 jobs/adopt.go:261  [T1,n1] 970  job 102: resuming execution
E230318 06:04:34.094634 7236 jobs/adopt.go:225  [T1,n1] 971  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.109017 7242 jobs/adopt.go:261  [T1,n1] 972  job 102: resuming execution
E230318 06:04:34.112793 7242 jobs/adopt.go:225  [T1,n1] 973  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.127350 7248 jobs/adopt.go:261  [T1,n1] 974  job 102: resuming execution
E230318 06:04:34.131071 7248 jobs/adopt.go:225  [T1,n1] 975  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.145815 7229 jobs/adopt.go:261  [T1,n1] 976  job 102: resuming execution
E230318 06:04:34.149665 7229 jobs/adopt.go:225  [T1,n1] 977  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.167616 7285 jobs/adopt.go:261  [T1,n1] 978  job 102: resuming execution
E230318 06:04:34.175317 7285 jobs/adopt.go:225  [T1,n1] 979  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.189554 7216 jobs/adopt.go:261  [T1,n1] 980  job 102: resuming execution
E230318 06:04:34.192795 7216 jobs/adopt.go:225  [T1,n1] 981  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.206846 7304 jobs/adopt.go:261  [T1,n1] 982  job 102: resuming execution
E230318 06:04:34.211430 7304 jobs/adopt.go:225  [T1,n1] 983  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.226543 7310 jobs/adopt.go:261  [T1,n1] 984  job 102: resuming execution
E230318 06:04:34.231578 7310 jobs/adopt.go:225  [T1,n1] 985  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230318 06:04:34.245408 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 986  established range feed over system.descriptors starting at time 1679119473.687549027,0
I230318 06:04:34.245464 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 987  established range feed over system.zones starting at time 1679119473.687549027,0
I230318 06:04:34.245507 7292 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 988  established range feed over system.protected_ts_records starting at time 1679119473.687549027,0
I230318 06:04:34.245581 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 989  established range feed over system.descriptors starting at time 1679119473.687549027,0
I230318 06:04:34.245618 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 990  established range feed over system.zones starting at time 1679119473.687549027,0
I230318 06:04:34.245655 7293 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 991  established range feed over system.protected_ts_records starting at time 1679119473.687549027,0
I230318 06:04:34.247033 7133 jobs/adopt.go:261  [T1,n1] 992  job 102: resuming execution
    sqlwatcher_test.go:346: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:346
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestSQLWatcherMultiple
I230318 06:04:34.248978 5157 testutils/testcluster/testcluster.go:148  [-] 993  TestCluster quiescing nodes
I230318 06:04:34.249234 6415 jobs/registry.go:1494  [T1,n1] 994  KEY VISUALIZER job 100: stepping through state succeeded
W230318 06:04:34.249319 6384 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 995  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/102/2/1: node unavailable; try another peer
I230318 06:04:34.249372 4952 server/start_listen.go:103  [T1,n1] 996  server shutting down: instructing cmux to stop accepting
W230318 06:04:34.249542 5087 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 998  exiting heartbeat loop
I230318 06:04:34.249552 6954 rpc/context.go:2249  [T1,n1,rnode=1,raddr=127.0.0.1:42991,class=default,rpc] 999  connection heartbeat loop ended with err: <nil>
W230318 06:04:34.249570 5087 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 1000  exiting heartbeat loop
E230318 06:04:34.249586 5087 server/server_sql.go:490  [n1] 1001  failed to run update of instance with new session ID: node unavailable; try another peer
I230318 06:04:34.249739 5611 sql/stats/automatic_stats.go:507  [T1,n1] 1002  quiescing auto stats refresher
W230318 06:04:34.249527 5617 jobs/registry.go:1052  [T1,n1] 997  canceling all adopted jobs due to stopper quiescing
E230318 06:04:34.250980 6635 jobs/registry.go:924  [T1,n1] 1003  error getting live session: node unavailable; try another peer
--- FAIL: TestSQLWatcherMultiple (1.71s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-25594

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 18, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 18, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 18, 2023
@cockroach-teamcity
Copy link
Member Author

pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test.TestSQLWatcherMultiple failed with artifacts on master @ 53dbb86acb1d48309530181b94838faf937084d3:

I230319 06:21:35.730407 1511580 3@pebble/event.go:734  [n1,s1,pebble] 758  [JOB 7] WAL created 000006
I230319 06:21:35.734662 1504562 1@server/server_sql.go:1644  [T1,n1] 759  serving sql connections
I230319 06:21:35.736259 1504562 testutils/testcluster/testcluster.go:1352  [-] 760  WaitForFullReplication
I230319 06:21:35.736387 1504562 testutils/testcluster/testcluster.go:1356  [-] 761  WaitForFullReplication took: 335ns
I230319 06:21:35.752216 2026225 jobs/adopt.go:261  [T1,n1] 762  job 102: resuming execution
I230319 06:21:35.772128 2026047 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 763 ={"Timestamp":1679206895754185989,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.rangefeed.enabled\" = true","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.rangefeed.enabled","Value":"true"}
I230319 06:21:35.791624 2026047 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 764 ={"Timestamp":1679206895773744773,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"kv.closed_timestamp.target_duration\" = '100ms'","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"kv.closed_timestamp.target_duration","Value":"00:00:00.1"}
E230319 06:21:35.805783 2026225 jobs/adopt.go:225  [T1,n1] 765  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:35.809015 2033139 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r3/1:/System/{NodeLive…-tsd}] 766  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230319 06:21:35.826393 2026047 5@util/log/event_log.go:32  [T1,n1,client=127.0.0.1:33014,user=root] 767 ={"Timestamp":1679206895792702326,"EventType":"create_table","Statement":"CREATE TABLE defaultdb.public.t ()","Tag":"CREATE TABLE","User":"root","DescriptorID":104,"TableName":"defaultdb.public.t"}
I230319 06:21:35.871352 2039969 jobs/adopt.go:261  [T1,n1] 768  job 102: resuming execution
E230319 06:21:35.899255 2039969 jobs/adopt.go:225  [T1,n1] 769  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:35.900420 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 770  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:35.905270 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 771  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:35.906206 2043248 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 772  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:35.911513 2044100 13@kv/kvserver/replicate_queue.go:842  [T1,n1,replicate,s1,r34/1:/Table/3{2-3}] 773  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230319 06:21:35.945009 2046031 jobs/adopt.go:261  [T1,n1] 774  job 102: resuming execution
E230319 06:21:35.992199 2046031 jobs/adopt.go:225  [T1,n1] 775  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.029074 2050686 jobs/adopt.go:261  [T1,n1] 776  job 102: resuming execution
E230319 06:21:36.064120 2050686 jobs/adopt.go:225  [T1,n1] 777  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.095311 2055455 jobs/adopt.go:261  [T1,n1] 778  job 102: resuming execution
E230319 06:21:36.121540 2055455 jobs/adopt.go:225  [T1,n1] 779  could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
I230319 06:21:36.158237 2059792 jobs/adopt.go:261  [T1,n1] 780  job 102: resuming execution
    sqlwatcher_test.go:346: 
        	Error Trace:	pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/spanconfigsqlwatcherccl_test/pkg/ccl/spanconfigccl/spanconfigsqlwatcherccl/sqlwatcher_test.go:346
        	            				GOROOT/src/runtime/asm_amd64.s:1594
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestSQLWatcherMultiple
I230319 06:21:36.175683 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 781  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:36.175819 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 782  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:36.175806 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:245  [-] 783  established range feed over system.descriptors starting at time 1679206895.792448291,0
I230319 06:21:36.176231 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:304  [-] 784  established range feed over system.zones starting at time 1679206895.792448291,0
I230319 06:21:36.176556 2061165 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 785  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:36.176901 2061166 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:400  [-] 786  established range feed over system.protected_ts_records starting at time 1679206895.792448291,0
I230319 06:21:36.181453 1504562 testutils/testcluster/testcluster.go:148  [-] 787  TestCluster quiescing nodes
I230319 06:21:36.182024 1509233 server/start_listen.go:103  [T1,n1] 788  server shutting down: instructing cmux to stop accepting
W230319 06:21:36.182828 2038349 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 789  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/104/2/1: node unavailable; try another peer
I230319 06:21:36.182942 1524017 sql/stats/automatic_stats.go:507  [T1,n1] 790  quiescing auto stats refresher
I230319 06:21:36.184372 1886025 jobs/registry.go:1494  [T1,n1] 791  KEY VISUALIZER job 100: stepping through state succeeded
W230319 06:21:36.183006 1522168 sql/sqlliveness/slinstance/slinstance.go:312  [n1] 792  exiting heartbeat loop
W230319 06:21:36.184752 1522168 sql/sqlliveness/slinstance/slinstance.go:297  [n1] 793  exiting heartbeat loop
E230319 06:21:36.184795 1522168 server/server_sql.go:490  [n1] 794  failed to run update of instance with new session ID: node unavailable; try another peer
I230319 06:21:36.183743 1609055 rpc/context.go:2249  [T1,n1,rnode=1,raddr=127.0.0.1:35305,class=default,rpc] 795  connection heartbeat loop ended with err: <nil>
E230319 06:21:36.186132 1524057 jobs/registry.go:968  [T1,n1] 796  error expiring job sessions: expire-sessions: failed to read query result: query execution canceled
E230319 06:21:36.186791 1524057 jobs/registry.go:976  [T1,n1] 797  failed to serve pause and cancel requests: txn exec: context canceled
W230319 06:21:36.186954 1524057 jobs/registry.go:1052  [T1,n1] 798  canceling all adopted jobs due to stopper quiescing
E230319 06:21:36.187169 1938660 jobs/registry.go:924  [T1,n1] 799  error getting live session: node unavailable; try another peer
--- FAIL: TestSQLWatcherMultiple (6.72s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

12959 runs so far, 0 failures, over 9m20s

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 30, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 31, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 3, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
craig bot pushed a commit that referenced this issue Apr 4, 2023
100189:  kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy

Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist

We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection.

Epic: none
Fixes: #100391
Fixes: #77754
Informs: #80034

Release note: None

100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w

The test failure is showing more total wide scans
than expected. Change the compact stats job to run
once a year to avoid it running at the same time
as the test.

The interceptor is disabled right after delete
reducing the possibility of another operation
causing a conflict.

Epic: none
closes: #99653

Release note: none

100589: allocator: deflake full disk test r=andrewbaptist a=kvoli

In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes.

This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold

```
dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress
...
15714 runs so far, 0 failures, over 39m45s
```

Fixes: #100033

Release note: None

100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg

After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner.

This change ensures roachtest explicitly sets config.Quiet to true.

[1] #99133

Epic: none

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 7, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 9, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 12, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
craig bot pushed a commit that referenced this issue Apr 13, 2023
100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error.

Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later.

It appears that completing future multiple times should be okay; however, it is still a bit worrysome.  The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent the error when the future completed.  Prior to this change, this happened inline (via `WhenReady` closure).  This is dangerous since this closure may run when important locks (such as raft mu) are being held.  What could happen is that mux rangefeed encounters a retryable error.  The future is prepared with error value, which causes an error to be sent to the client.  This happens with some lock being held.  The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory.  Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea.  This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
@miretskiy
Copy link
Contributor

@irfansharif were you able to reproduce this failure? Seems like nothing happened for two weeks. Why is that? was there something that you did on your side (disable metamorphic tests perhpaps)?

At any rate, I would love to know if this issue reproduces w/ mux rf with recent changes.

@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. T-kv KV Team X-stale
Projects
None yet
Development

No branches or pull requests

3 participants