-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884
release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884
Conversation
This commit fixes now stale comment as well as allocates one map precisely. Release note: None
This commit fixes a long-standing bug around the flow registry that could make the drain loop be stuck forever. Drain process works by draining several components in a loop until each component reports that there was no more remaining work when the drain iteration was initiated. One of the components to be drained is the flow registry: namely, we want to make sure that there are no more remote flows present on the node. We track that by having a map from the `FlowID` to the `flowEntry` object. Previously, it was possible for a `flowEntry` to become "stale" and remain in the map forever. In particular, this was the case when - `ConnectInboundStream` was called before the flow was scheduled - the gRPC "handshake" failed in `ConnectInboundStream` (most likely due to a network fluke) - the flow never arrived (perhaps it was canceled before `Flow.StartInternal` is called), or it arrived too late when the registry was marked as "draining". With such a scenario we would create a `flowEntry` with ref count of zero and add it to the map in `ConnectInboundStream`, but no one would ever remove it. This commit fixes this oversight by adjusting the ref counting logic a bit so that we always hold a reference throughout (and only until the end of) `ConnectInboundStream`. Release note (bug fix): A rare bug with distributed plans shutdown has been fixed that previously could make the graceful drain of cockroach nodes be retrying forever. The bug has been present since before 22.1. The drain process is affected by this bug if you see messages in the logs like `drain details: distSQL execution flows:` with non-zero number of flows that isn't going down over long period of time.
76173fa
to
b39beaf
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
I thought that the bug this PR fixes is pretty rare, but it seems likely that we're hitting it relatively easily on the 23.1 test cluster, so it seems worthy of a backport to 23.1.0 (and is of less risk), cc @mgartner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Backport 2/2 commits from #100761 on behalf of @yuzefovich.
/cc @cockroachdb/release
This commit fixes a long-standing bug around the flow registry that
could make the drain loop be stuck forever. Drain process works by
draining several components in a loop until each component reports that
there was no more remaining work when the drain iteration was initiated.
One of the components to be drained is the flow registry: namely, we
want to make sure that there are no more remote flows present on the
node. We track that by having a map from the
FlowID
to theflowEntry
object.
Previously, it was possible for a
flowEntry
to become "stale" andremain in the map forever. In particular, this was the case when
ConnectInboundStream
was called before the flow was scheduledConnectInboundStream
(most likely dueto a network fluke)
Flow.StartInternal
is called), or it arrived too late when theregistry was marked as "draining".
With such a scenario we would create a
flowEntry
with ref count ofzero and add it to the map in
ConnectInboundStream
, but no one wouldever remove it. This commit fixes this oversight by adjusting the ref
counting logic a bit so that we always hold a reference throughout (and
only until the end of)
ConnectInboundStream
.Fixes: #100710.
Release note (bug fix): A rare bug with distributed plans shutdown has
been fixed that previously could make the graceful drain of cockroach
nodes be retrying forever. The bug has been present since before 22.1.
The drain process is affected by this bug if you see messages in the
logs like
drain details: distSQL execution flows:
with non-zero numberof flows that isn't going down over long period of time.
Release justification: bug fix.