-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: potential deadlock around rangefeeds in TestQueryCache #99640
Labels
A-cdc
Change Data Capture
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-test-failure
Broken test (automatically or manually discovered).
O-robot
Originated from a bot.
T-cdc
Milestone
Comments
cockroach-teamcity
added
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-test-failure
Broken test (automatically or manually discovered).
O-robot
Originated from a bot.
labels
Mar 27, 2023
cc @miretskiy |
yuzefovich
changed the title
sql: TestQueryCache failed
sql: potential deadlock around rangefeeds in TestQueryCache
Mar 27, 2023
yuzefovich
added
A-kv-replication
Relating to Raft, consensus, and coordination.
and removed
T-sql-queries
SQL Queries Team
labels
Mar 27, 2023
cc @cockroachdb/replication |
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Mar 30, 2023
Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Mar 31, 2023
Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 3, 2023
Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 4, 2023
Make sure mux rangefeed uses correct context if it needs to restart range feeds. Fix catchup reservation metric accounting in mux rangefeed. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 4, 2023
100189: kvcoord: Restart ranges on a dedicated goroutine. r=miretskiy a=miretskiy Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None 100525: rpc: Handle closed error r=erikgrinaker a=andrewbaptist We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection. Epic: none Fixes: #100391 Fixes: #77754 Informs: #80034 Release note: None 100528: sql: fix flaky TestSQLStatsCompactor r=j82w a=j82w The test failure is showing more total wide scans than expected. Change the compact stats job to run once a year to avoid it running at the same time as the test. The interceptor is disabled right after delete reducing the possibility of another operation causing a conflict. Epic: none closes: #99653 Release note: none 100589: allocator: deflake full disk test r=andrewbaptist a=kvoli In #97409 we introduced cluster settings to control the disk fullness threshold for rebalancing towards a store and shedding replicas off of the store. The `TestAllocatorFullDisks` assumes the total number of range bytes is equal or less than the rebalance threshold of the nodes, however the test was updated to use the shed threshold instead. This caused the test to flake occasionally as there was more than the expected amount of total range bytes. This patch changes the ranges per node calculation to use the rebalance threshold again, instead of the shed threshold ``` dev test pkg/kv/kvserver/allocator/allocatorimpl -f TestAllocatorFullDisks -v --stress ... 15714 runs so far, 0 failures, over 39m45s ``` Fixes: #100033 Release note: None 100610: roachtest: set config.Quiet to true r=herkolategan a=srosenberg After refactoring in [1], the default of config.Quiet was set to false since the roachprod CLI option is intended to set it to true. This resulted in an unwanted side-effect, namely roachtests running with the new default. Consequently, test_runner's log ended up with a bunch of (terminal) escape codes due to (status) spinner. This change ensures roachtest explicitly sets config.Quiet to true. [1] #99133 Epic: none Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]>
This was referenced Apr 4, 2023
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed). Fix logic bug in stuck range handling. Increase verbosity of logging to help debug mux rangefeed issues. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Informs cockroachdb#100468 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 4, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 7, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 9, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
Updating the |
nicktrav
added
A-cdc
Change Data Capture
T-cdc
and removed
A-kv-replication
Relating to Raft, consensus, and coordination.
GA-blocker
labels
Apr 12, 2023
cc @cockroachdb/cdc |
miretskiy
pushed a commit
to miretskiy/cockroach
that referenced
this issue
Apr 12, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs cockroachdb#99560 Informs cockroachdb#99640 Informs cockroachdb#99214 Informs cockroachdb#98925 Informs cockroachdb#99092 Informs cockroachdb#99212 Informs cockroachdb#99910 Informs cockroachdb#99560 Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 13, 2023
100649: kvcoord: Rework error propagation in mux rangefeed r=miretskiy a=miretskiy Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Apr 13, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None
blathers-crl bot
pushed a commit
that referenced
this issue
Apr 13, 2023
Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None
23.1 has muxrf disabled for now; closing this issue (muxrf still enabled on the master) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-cdc
Change Data Capture
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-test-failure
Broken test (automatically or manually discovered).
O-robot
Originated from a bot.
T-cdc
sql.TestQueryCache failed with artifacts on release-23.1 @ 39d947d7ba4a63908039bd03a51e030374ea6675:
Parameters:
TAGS=bazel,gss,deadlock
Help
See also: How To Investigate a Go Test Failure (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-25998
Epic CRDB-11732
The text was updated successfully, but these errors were encountered: