Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: potential deadlock around rangefeeds in TestQueryCache #99640

Closed
cockroach-teamcity opened this issue Mar 27, 2023 · 5 comments
Closed

sql: potential deadlock around rangefeeds in TestQueryCache #99640

cockroach-teamcity opened this issue Mar 27, 2023 · 5 comments
Assignees
Labels
A-cdc Change Data Capture branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 27, 2023

sql.TestQueryCache failed with artifacts on release-23.1 @ 39d947d7ba4a63908039bd03a51e030374ea6675:


goroutine 175144251 lock 0xc000971708
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1615 kvserver.(*Replica).sendLocalRaftMsg ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1614 kvserver.(*Replica).sendLocalRaftMsg ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1530 kvserver.(*Replica).sendRaftMessages ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1490 kvserver.(*replicaSyncCallback).OnLogSync ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore/logstore.go:336 logstore.(*nonBlockingSyncWaiterCallback).run ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore/sync_waiter.go:93 logstore.(*SyncWaiterLoop).waitLoop ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore/sync_waiter.go:84 logstore.(*SyncWaiterLoop).Start.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 204735041 lock 0xc0179b9b80
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:85 go-deadlock.(*Mutex).Lock ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:81 kvcoord.(*txnLockGatekeeper).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:47 kvcoord.(*txnMetricRecorder).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:203 kvcoord.(*txnCommitter).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:230 kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:154 kvcoord.(*txnSpanRefresher).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 kvcoord.(*txnPipeliner).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:104 kvcoord.(*txnSeqNumAllocator).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:246 kvcoord.(*txnHeartbeater).SendLocked ???
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:528 kvcoord.(*TxnCoordSender).Send ???
github.com/cockroachdb/cockroach/pkg/kv/db.go:1001 kv.(*DB).sendUsingSender ???
github.com/cockroachdb/cockroach/pkg/kv/txn.go:1059 kv.(*Txn).Send ???
github.com/cockroachdb/cockroach/pkg/kv/txn.go:684 kv.(*Txn).commit ???
github.com/cockroachdb/cockroach/pkg/kv/txn.go:699 kv.(*Txn).Commit ???
github.com/cockroachdb/cockroach/pkg/kv/txn.go:934 kv.(*Txn).exec ???
github.com/cockroachdb/cockroach/pkg/kv/db.go:966 kv.runTxn ???
github.com/cockroachdb/cockroach/pkg/kv/db.go:926 kv.(*DB).TxnWithAdmissionControl ???
github.com/cockroachdb/cockroach/pkg/kv/db.go:901 kv.(*DB).Txn ???
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/storage.go:188 lease.storage.acquire ???
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:543 lease.acquireNodeLease.func1 ???
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:386 singleflight.(*Group).doCall.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:305 stop.(*Stopper).RunTask ???
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:385 singleflight.(*Group).doCall ???



=== RUN   TestQueryCache/group/simple-prepare-with-args
=== PAUSE TestQueryCache/group/simple-prepare-with-args
=== CONT  TestQueryCache/group/simple-prepare-with-args
=== RUN   TestQueryCache/group/prepare-hints
=== PAUSE TestQueryCache/group/prepare-hints
=== CONT  TestQueryCache/group/prepare-hints
=== RUN   TestQueryCache/group/simple
=== PAUSE TestQueryCache/group/simple
=== CONT  TestQueryCache/group/simple
=== RUN   TestQueryCache/group/statschange
=== PAUSE TestQueryCache/group/statschange
=== CONT  TestQueryCache/group/statschange

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-25998

Epic CRDB-11732

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 27, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 27, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Mar 27, 2023
@yuzefovich
Copy link
Member

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 190368750 lock 0xc02d7f90d8
github.com/cockroachdb/cockroach/pkg/server/node.go:1493 server.(*lockedMuxStream).Send ??? <<<<<
github.com/cockroachdb/cockroach/pkg/server/node.go:1492 server.(*lockedMuxStream).Send ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1475 server.(*setRangeIDEventSink).Send ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:78 kvserver.(*lockedRangefeedStream).Send ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:330 rangefeed.(*registration).outputLoop ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:351 rangefeed.(*registration).runOutputLoop ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/processor.go:321 rangefeed.(*Processor).run.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

Have been trying to lock it again for more than 5m0s
goroutine 190529199 lock 0xc02d7f90d8
github.com/cockroachdb/cockroach/pkg/server/node.go:1493 server.(*lockedMuxStream).Send ??? <<<<<
github.com/cockroachdb/cockroach/pkg/server/node.go:1492 server.(*lockedMuxStream).Send ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1475 server.(*setRangeIDEventSink).Send ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:78 kvserver.(*lockedRangefeedStream).Send ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:330 rangefeed.(*registration).outputLoop ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:351 rangefeed.(*registration).runOutputLoop ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/processor.go:321 rangefeed.(*Processor).run.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

Here is what goroutine 190368750 doing now
goroutine 190368750 [select, 5 minutes]:
github.com/cockroachdb/cockroach/pkg/rpc.(*pipe).send(0x4595e20?, {0x5fe3640, 0xc01285bec0}, {0x4b3d7a0?, 0xc025d9a040?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1414 +0x97
github.com/cockroachdb/cockroach/pkg/rpc.pipeWriter.send(...)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1452
github.com/cockroachdb/cockroach/pkg/rpc.serverStream.SendMsg(...)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1540
github.com/cockroachdb/cockroach/pkg/rpc.muxRangeFeedServerAdapter.Send(...)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1373
github.com/cockroachdb/cockroach/pkg/server.(*lockedMuxStream).Send(0xc02d7f90c8, 0xc006dff040?)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1495 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*setRangeIDEventSink).Send(0xc02d428cf0, 0xc0131cec00)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1476 +0x93
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*lockedRangefeedStream).Send(0xc0118fa468, 0xc07bc06de4?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:79 +0xb2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).outputLoop(0xc00864cb40, {0x5fe3598, 0xc05f189b40})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:328 +0x29e
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).runOutputLoop(0xc00864cb40, {0x5fe3640, 0xc037c95e90}, 0xc068e39620?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:351 +0xc5
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Processor).run.func1({0x5fe3640?, 0xc037c95e90?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/processor.go:320 +0x4c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x43b
...

cc @miretskiy

@yuzefovich yuzefovich changed the title sql: TestQueryCache failed sql: potential deadlock around rangefeeds in TestQueryCache Mar 27, 2023
@yuzefovich yuzefovich added A-kv-replication Relating to Raft, consensus, and coordination. and removed T-sql-queries SQL Queries Team labels Mar 27, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 27, 2023

cc @cockroachdb/replication

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 30, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 31, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 3, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
craig bot pushed a commit that referenced this issue Apr 4, 2023
100189:  kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy

Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist

We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection.

Epic: none
Fixes: #100391
Fixes: #77754
Informs: #80034

Release note: None

100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w

The test failure is showing more total wide scans
than expected. Change the compact stats job to run
once a year to avoid it running at the same time
as the test.

The interceptor is disabled right after delete
reducing the possibility of another operation
causing a conflict.

Epic: none
closes: #99653

Release note: none

100589: allocator: deflake full disk test r=andrewbaptist a=kvoli

In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes.

This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold

```
dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress
...
15714 runs so far, 0 failures, over 39m45s
```

Fixes: #100033

Release note: None

100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg

After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner.

This change ensures roachtest explicitly sets config.Quiet to true.

[1] #99133

Epic: none

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 7, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 9, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
@nicktrav
Copy link
Collaborator

Updating the T- and A- labels to better reflect current ownership.

@nicktrav nicktrav added A-cdc Change Data Capture T-cdc and removed A-kv-replication Relating to Raft, consensus, and coordination. GA-blocker labels Apr 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 12, 2023

cc @cockroachdb/cdc

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 12, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560

Release note: None
craig bot pushed a commit that referenced this issue Apr 13, 2023
100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error.

Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later.

It appears that completing future multiple times should be okay; however, it is still a bit worrysome.  The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent the error when the future completed.  Prior to this change, this happened inline (via `WhenReady` closure).  This is dangerous since this closure may run when important locks (such as raft mu) are being held.  What could happen is that mux rangefeed encounters a retryable error.  The future is prepared with error value, which causes an error to be sent to the client.  This happens with some lock being held.  The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory.  Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea.  This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 13, 2023
Prior to this change, there were cases where a future
used to wait for a single range feed completion, may
be completed multiple times, or a message about range
feed termination may be sent multiple times on a single
mux rangefeed stream.

One of those cases was a check for `ensureClosedTimestampStarted`.
If this method returned an error, we would immediately send
the error on the rpc stream, and then complete the future
with nil error.

Another instance was when registry would `DisconnectWithErr` --
in that case, we would first complete future in this method, and
then, complete it again later.

It appears that completing future multiple times should be
okay; however, it is still a bit worrysome.  The deadlocks observed
were all in the local RPC bypas (`rpc/context.go`), and it's
not a stretch to imagine that as soon as the first error
(e.g. from ensureClosedTimestampStarted) is returned, the
goroutine reading these messages terminates, and causes the
subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent
the error when the future completed.  Prior to this change, this
happened inline (via `WhenReady` closure).  This is dangerous since
this closure may run when important locks (such as raft mu) are being
held.  What could happen is that mux rangefeed encounters a retryable
error.  The future is prepared with error value, which causes
an error to be sent to the client.  This happens with some lock being
held.  The client, notices this error, and attempts to restart
rangefeed -- to the same server, and that could block; At least in
theory.  Regardless, it seems that performing IO while the locks could
be potentially held, is not a good idea.  This PR fixes this problem
by shunting logical rangefeed completion notification to a dedicated
go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None
@miretskiy
Copy link
Contributor

23.1 has muxrf disabled for now; closing this issue (muxrf still enabled on the master)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-cdc
Projects
None yet
Development

No branches or pull requests

5 participants