Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colrpc: propagate the flow cancellation as ungraceful for FlowStream RPC #73887

Merged
merged 1 commit into from
Dec 16, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 16, 2021

This commit fixes an oversight in the cancellation protocol of the
vectorized inbox/outbox communication. Previously, when the flow context
of the inbox host has been canceled (indicating that the whole query
should be canceled) we would propagate it as a graceful completion of
the FlowStream RPC which would result in the outbox cancelling only
its own subtree on the remote node. However, what we ought to do is to
propagate such cancellation as the ungraceful RPC completion so that the
outbox would also cancel the flow context of its own host.

In some rare cases the old behavior could result in some flows being
stuck forever (until a node is restarted) because they would get blocked
on producing the data when their consumer has already exited.

The behavior in this fix is what we already have in place for the
row-by-row engine (see processInboundStreamHelper in
flowinfra/inbound.go).

Fixes: https://github.com/cockroachlabs/support/issues/1326.
Fixes: #72445.

Release note (bug fix): The bug with the ungraceful shutdown of the
distributed queries in some rare cases has been fixed. "Ungraceful" here
means because of the statement_timeout (most likely) or because a node
crashed.

This commit fixes an oversight in the cancellation protocol of the
vectorized inbox/outbox communication. Previously, when the flow context
of the inbox host has been canceled (indicating that the whole query
should be canceled) we would propagate it as a graceful completion of
the `FlowStream` RPC which would result in the outbox cancelling only
its own subtree on the remote node. However, what we ought to do is to
propagate such cancellation as the ungraceful RPC completion so that the
outbox would also cancel the flow context of its own host.

In some rare cases the old behavior could result in some flows being
stuck forever (until a node is restarted) because they would get blocked
on producing the data when their consumer has already exited.

The behavior in this fix is what we already have in place for the
row-by-row engine (see `processInboundStreamHelper` in
`flowinfra/inbound.go`).

Release note (bug fix): The bug with the ungraceful shutdown of the
distributed queries in some rare cases has been fixed. "Ungraceful" here
means because of the `statement_timeout` (most likely) or because a node
crashed.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

The most recent changes for the cancellation protocol have been in #63772. I think that PR fixed one problem, and I'm hoping that this PR fixes the protocol for good.

I still don't understand why in https://github.com/cockroachlabs/support/issues/1326 some goroutines ended up getting stuck, but this fix makes sense to me, and I don't know how to get to the very bottom of the behavior in https://github.com/cockroachlabs/support/issues/1326 given very rare occurrences of the problem.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Awesome work tracking this down!

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @yuzefovich)

@yuzefovich
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 16, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Dec 16, 2021

Build succeeded:

@craig craig bot merged commit c0fe9cf into cockroachdb:master Dec 16, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 16, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2203bae to blathers/backport-release-21.1-73887: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


error creating merge commit from 2203bae to blathers/backport-release-21.2-73887: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: goroutines seemingly not cancelling
4 participants