release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

blathers-crl · 2023-04-13T00:59:37Z

Backport 1/1 commits from #100649 on behalf of @miretskiy.

/cc @cockroachdb/release

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream.

One of those cases was a check for ensureClosedTimestampStarted. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error.

Another instance was when registry would DisconnectWithErr -- in that case, we would first complete future in this method, and then, complete it again later.

It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (rpc/context.go), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock.

Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via WhenReady closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine.

Informs #99560
Informs #99640
Informs #99214
Informs #98925
Informs #99092
Informs #99212
Informs #99910
Informs #99560

Release note: None

Release justification: bug fixes to a functionality disabled by default

Prior to this change, there were cases where a future used to wait for a single range feed completion, may be completed multiple times, or a message about range feed termination may be sent multiple times on a single mux rangefeed stream. One of those cases was a check for `ensureClosedTimestampStarted`. If this method returned an error, we would immediately send the error on the rpc stream, and then complete the future with nil error. Another instance was when registry would `DisconnectWithErr` -- in that case, we would first complete future in this method, and then, complete it again later. It appears that completing future multiple times should be okay; however, it is still a bit worrysome. The deadlocks observed were all in the local RPC bypas (`rpc/context.go`), and it's not a stretch to imagine that as soon as the first error (e.g. from ensureClosedTimestampStarted) is returned, the goroutine reading these messages terminates, and causes the subsequent attempt to send the error deadlock. Another hypothetical issue is how the mux rangefeed sent the error when the future completed. Prior to this change, this happened inline (via `WhenReady` closure). This is dangerous since this closure may run when important locks (such as raft mu) are being held. What could happen is that mux rangefeed encounters a retryable error. The future is prepared with error value, which causes an error to be sent to the client. This happens with some lock being held. The client, notices this error, and attempts to restart rangefeed -- to the same server, and that could block; At least in theory. Regardless, it seems that performing IO while the locks could be potentially held, is not a good idea. This PR fixes this problem by shunting logical rangefeed completion notification to a dedicated go routine. Informs #99560 Informs #99640 Informs #99214 Informs #98925 Informs #99092 Informs #99212 Informs #99910 Informs #99560 Release note: None

blathers-crl · 2023-04-13T00:59:41Z

cockroach-teamcity · 2023-04-13T00:59:46Z

This change is

blathers-crl bot requested a review from a team April 13, 2023 00:59

blathers-crl bot requested a review from a team as a code owner April 13, 2023 00:59

blathers-crl bot force-pushed the blathers/backport-release-23.1.0-100649 branch from 63a2592 to 10ee447 Compare April 13, 2023 00:59

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 13, 2023

blathers-crl bot assigned miretskiy Apr 13, 2023

blathers-crl bot requested review from erikgrinaker, miretskiy and pav-kv April 13, 2023 00:59

erikgrinaker approved these changes Apr 13, 2023

View reviewed changes

miretskiy merged commit 2122344 into release-23.1.0 Apr 13, 2023

miretskiy deleted the blathers/backport-release-23.1.0-100649 branch April 13, 2023 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

blathers-crl bot commented Apr 13, 2023 •

edited by miretskiy

Loading

blathers-crl bot commented Apr 13, 2023

cockroach-teamcity commented Apr 13, 2023

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

release-23.1.0: kvcoord: Rework error propagation in mux rangefeed #101406

Conversation

blathers-crl bot commented Apr 13, 2023 • edited by miretskiy Loading

blathers-crl bot commented Apr 13, 2023

cockroach-teamcity commented Apr 13, 2023

blathers-crl bot commented Apr 13, 2023 •

edited by miretskiy

Loading