release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

blathers-crl · 2023-04-19T23:15:48Z

Backport 2/2 commits from #100761 on behalf of @yuzefovich.

/cc @cockroachdb/release

This commit fixes a long-standing bug around the flow registry that
could make the drain loop be stuck forever. Drain process works by
draining several components in a loop until each component reports that
there was no more remaining work when the drain iteration was initiated.
One of the components to be drained is the flow registry: namely, we
want to make sure that there are no more remote flows present on the
node. We track that by having a map from the FlowID to the flowEntry
object.

Previously, it was possible for a flowEntry to become "stale" and
remain in the map forever. In particular, this was the case when

ConnectInboundStream was called before the flow was scheduled
the gRPC "handshake" failed in ConnectInboundStream (most likely due
to a network fluke)
the flow never arrived (perhaps it was canceled before
Flow.StartInternal is called), or it arrived too late when the
registry was marked as "draining".

With such a scenario we would create a flowEntry with ref count of
zero and add it to the map in ConnectInboundStream, but no one would
ever remove it. This commit fixes this oversight by adjusting the ref
counting logic a bit so that we always hold a reference throughout (and
only until the end of) ConnectInboundStream.

Fixes: #100710.

Release note (bug fix): A rare bug with distributed plans shutdown has
been fixed that previously could make the graceful drain of cockroach
nodes be retrying forever. The bug has been present since before 22.1.
The drain process is affected by this bug if you see messages in the
logs like drain details: distSQL execution flows: with non-zero number
of flows that isn't going down over long period of time.

Release justification: bug fix.

This commit fixes now stale comment as well as allocates one map precisely. Release note: None

This commit fixes a long-standing bug around the flow registry that could make the drain loop be stuck forever. Drain process works by draining several components in a loop until each component reports that there was no more remaining work when the drain iteration was initiated. One of the components to be drained is the flow registry: namely, we want to make sure that there are no more remote flows present on the node. We track that by having a map from the `FlowID` to the `flowEntry` object. Previously, it was possible for a `flowEntry` to become "stale" and remain in the map forever. In particular, this was the case when - `ConnectInboundStream` was called before the flow was scheduled - the gRPC "handshake" failed in `ConnectInboundStream` (most likely due to a network fluke) - the flow never arrived (perhaps it was canceled before `Flow.StartInternal` is called), or it arrived too late when the registry was marked as "draining". With such a scenario we would create a `flowEntry` with ref count of zero and add it to the map in `ConnectInboundStream`, but no one would ever remove it. This commit fixes this oversight by adjusting the ref counting logic a bit so that we always hold a reference throughout (and only until the end of) `ConnectInboundStream`. Release note (bug fix): A rare bug with distributed plans shutdown has been fixed that previously could make the graceful drain of cockroach nodes be retrying forever. The bug has been present since before 22.1. The drain process is affected by this bug if you see messages in the logs like `drain details: distSQL execution flows:` with non-zero number of flows that isn't going down over long period of time.

blathers-crl · 2023-04-19T23:15:51Z

cockroach-teamcity · 2023-04-19T23:16:01Z

This change is

yuzefovich · 2023-04-19T23:17:00Z

I thought that the bug this PR fixes is pretty rare, but it seems likely that we're hitting it relatively easily on the 23.1 test cluster, so it seems worthy of a backport to 23.1.0 (and is of less risk), cc @mgartner

mgartner

LGTM

yuzefovich added 2 commits April 5, 2023 20:25

sql: address a couple of nits

3f2925b

This commit fixes now stale comment as well as allocates one map precisely. Release note: None

blathers-crl bot requested a review from a team April 19, 2023 23:15

blathers-crl bot requested review from a team as code owners April 19, 2023 23:15

blathers-crl bot force-pushed the blathers/backport-release-23.1.0-100761 branch from 76173fa to b39beaf Compare April 19, 2023 23:15

blathers-crl bot requested a review from cucaroach April 19, 2023 23:15

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 19, 2023

blathers-crl bot assigned yuzefovich Apr 19, 2023

blathers-crl bot requested a review from msirek April 19, 2023 23:15

yuzefovich mentioned this pull request Apr 19, 2023

kvserver: leases appear stuck when draining a node #101885

Closed

mgartner approved these changes Apr 20, 2023

View reviewed changes

yuzefovich merged commit 2319279 into release-23.1.0 Apr 20, 2023

yuzefovich deleted the blathers/backport-release-23.1.0-100761 branch April 20, 2023 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

cockroach-teamcity commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

mgartner left a comment

release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

Conversation

blathers-crl bot commented Apr 19, 2023 • edited by yuzefovich Loading

blathers-crl bot commented Apr 19, 2023 • edited by yuzefovich Loading

cockroach-teamcity commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

mgartner left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading