Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1.0: flowinfra: fix a rare bug that could make drain be stuck forever #101884

Merged
merged 2 commits into from
Apr 20, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Apr 19, 2023

Backport 2/2 commits from #100761 on behalf of @yuzefovich.

/cc @cockroachdb/release


This commit fixes a long-standing bug around the flow registry that
could make the drain loop be stuck forever. Drain process works by
draining several components in a loop until each component reports that
there was no more remaining work when the drain iteration was initiated.
One of the components to be drained is the flow registry: namely, we
want to make sure that there are no more remote flows present on the
node. We track that by having a map from the FlowID to the flowEntry
object.

Previously, it was possible for a flowEntry to become "stale" and
remain in the map forever. In particular, this was the case when

  • ConnectInboundStream was called before the flow was scheduled
  • the gRPC "handshake" failed in ConnectInboundStream (most likely due
    to a network fluke)
  • the flow never arrived (perhaps it was canceled before
    Flow.StartInternal is called), or it arrived too late when the
    registry was marked as "draining".

With such a scenario we would create a flowEntry with ref count of
zero and add it to the map in ConnectInboundStream, but no one would
ever remove it. This commit fixes this oversight by adjusting the ref
counting logic a bit so that we always hold a reference throughout (and
only until the end of) ConnectInboundStream.

Fixes: #100710.

Release note (bug fix): A rare bug with distributed plans shutdown has
been fixed that previously could make the graceful drain of cockroach
nodes be retrying forever. The bug has been present since before 22.1.
The drain process is affected by this bug if you see messages in the
logs like drain details: distSQL execution flows: with non-zero number
of flows that isn't going down over long period of time.


Release justification: bug fix.

This commit fixes now stale comment as well as allocates one map
precisely.

Release note: None
This commit fixes a long-standing bug around the flow registry that
could make the drain loop be stuck forever. Drain process works by
draining several components in a loop until each component reports that
there was no more remaining work when the drain iteration was initiated.
One of the components to be drained is the flow registry: namely, we
want to make sure that there are no more remote flows present on the
node. We track that by having a map from the `FlowID` to the `flowEntry`
object.

Previously, it was possible for a `flowEntry` to become "stale" and
remain in the map forever. In particular, this was the case when
- `ConnectInboundStream` was called before the flow was scheduled
- the gRPC "handshake" failed in `ConnectInboundStream` (most likely due
to a network fluke)
- the flow never arrived (perhaps it was canceled before
`Flow.StartInternal` is called), or it arrived too late when the
registry was marked as "draining".

With such a scenario we would create a `flowEntry` with ref count of
zero and add it to the map in `ConnectInboundStream`, but no one would
ever remove it. This commit fixes this oversight by adjusting the ref
counting logic a bit so that we always hold a reference throughout (and
only until the end of) `ConnectInboundStream`.

Release note (bug fix): A rare bug with distributed plans shutdown has
been fixed that previously could make the graceful drain of cockroach
nodes be retrying forever. The bug has been present since before 22.1.
The drain process is affected by this bug if you see messages in the
logs like `drain details: distSQL execution flows:` with non-zero number
of flows that isn't going down over long period of time.
@blathers-crl blathers-crl bot requested a review from a team April 19, 2023 23:15
@blathers-crl blathers-crl bot requested review from a team as code owners April 19, 2023 23:15
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1.0-100761 branch from 76173fa to b39beaf Compare April 19, 2023 23:15
@blathers-crl blathers-crl bot requested a review from cucaroach April 19, 2023 23:15
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 19, 2023
@blathers-crl
Copy link
Author

blathers-crl bot commented Apr 19, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot requested a review from msirek April 19, 2023 23:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member

I thought that the bug this PR fixes is pretty rare, but it seems likely that we're hitting it relatively easily on the 23.1 test cluster, so it seems worthy of a backport to 23.1.0 (and is of less risk), cc @mgartner

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuzefovich yuzefovich merged commit 2319279 into release-23.1.0 Apr 20, 2023
@yuzefovich yuzefovich deleted the blathers/backport-release-23.1.0-100761 branch April 20, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants