-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colrpc: propagate the flow cancellation as ungraceful for FlowStream RPC #73887
Conversation
This commit fixes an oversight in the cancellation protocol of the vectorized inbox/outbox communication. Previously, when the flow context of the inbox host has been canceled (indicating that the whole query should be canceled) we would propagate it as a graceful completion of the `FlowStream` RPC which would result in the outbox cancelling only its own subtree on the remote node. However, what we ought to do is to propagate such cancellation as the ungraceful RPC completion so that the outbox would also cancel the flow context of its own host. In some rare cases the old behavior could result in some flows being stuck forever (until a node is restarted) because they would get blocked on producing the data when their consumer has already exited. The behavior in this fix is what we already have in place for the row-by-row engine (see `processInboundStreamHelper` in `flowinfra/inbound.go`). Release note (bug fix): The bug with the ungraceful shutdown of the distributed queries in some rare cases has been fixed. "Ungraceful" here means because of the `statement_timeout` (most likely) or because a node crashed.
The most recent changes for the cancellation protocol have been in #63772. I think that PR fixed one problem, and I'm hoping that this PR fixes the protocol for good. I still don't understand why in https://github.com/cockroachlabs/support/issues/1326 some goroutines ended up getting stuck, but this fix makes sense to me, and I don't know how to get to the very bottom of the behavior in https://github.com/cockroachlabs/support/issues/1326 given very rare occurrences of the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work tracking this down!
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @yuzefovich)
TFTRs! bors r+ |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2203bae to blathers/backport-release-21.1-73887: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. error creating merge commit from 2203bae to blathers/backport-release-21.2-73887: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This commit fixes an oversight in the cancellation protocol of the
vectorized inbox/outbox communication. Previously, when the flow context
of the inbox host has been canceled (indicating that the whole query
should be canceled) we would propagate it as a graceful completion of
the
FlowStream
RPC which would result in the outbox cancelling onlyits own subtree on the remote node. However, what we ought to do is to
propagate such cancellation as the ungraceful RPC completion so that the
outbox would also cancel the flow context of its own host.
In some rare cases the old behavior could result in some flows being
stuck forever (until a node is restarted) because they would get blocked
on producing the data when their consumer has already exited.
The behavior in this fix is what we already have in place for the
row-by-row engine (see
processInboundStreamHelper
inflowinfra/inbound.go
).Fixes: https://github.com/cockroachlabs/support/issues/1326.
Fixes: #72445.
Release note (bug fix): The bug with the ungraceful shutdown of the
distributed queries in some rare cases has been fixed. "Ungraceful" here
means because of the
statement_timeout
(most likely) or because a nodecrashed.