Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP on fixing the cancellation #74121

Closed
wants to merge 3 commits into from

Conversation

yuzefovich
Copy link
Member

No description provided.

In `flowCtxCancel` scenario there are two possible errors that might be
returned to the reader depending on the exact sequence of events:
- `QueryCanceledError` is used when the flow ctx cancellation is
observed before the stream arrived
- wrapped `context.Canceled` error is used when the inbox handler
goroutine notices the cancellation first and ungracefully shuts down
the stream.

Previously, we assumed that only the latter could occur, and we allow
for either.

Release note: None
@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Dec 21, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

This commit fixes a long standing bug in the distributed vectorized
query shutdown where in case of a graceful completion of the flow on one
node, we might get an error on another node resulting in the ungraceful
termination of the query. This was caused by the fact that on remote
nodes the last outbox to exit would cancel the flow context; however,
when instantiating `FlowStream` RPC the outboxes used a child context of
the flow context, so that "graceful" cancellation of the flow context
would cause the inbox to get an ungraceful termination of the gRPC
stream. As a result, the whole query could get "context canceled" error.

I believe this bug was introduced by me over two years ago because
I didn't fully understand how the shutdown should work, and in
particular I was missing that when an inbox observes the flow context
cancellation, it should terminate the `FlowStream` RPC ungracefully in
order to propagate the ungracefullness to the other side of the stream.
This shortcoming was fixed in the previous commit.

The shutdown protocol is now as follows:
- on the gateway node we keep on canceling the flow context at the very
end of the query execution. It doesn't matter whether the query resulted
in an error or not, and doing so allows us to ensure that everything
exits on the gateway node. This behavior is already present.
- due to the fix in a previous commit, that flow context cancellation
terminates ungracefully all still open gRPC streams for `FlowStream` RPC
for which the gateway node is the inbox host.
- the outboxes on the remote nodes get the ungraceful termination and
cancel the flow context of their hosts. This, in turn, would trigger
propagation of the ungraceful termination on other gRPC streams, etc.

Release note (bug fix): Previously, CockroachDB could return a spurious
"context canceled" error for a query that actually succeeded in
extremely rare cases, and this has now been fixed.
@yuzefovich
Copy link
Member Author

I've stressed distsql_union logic test file with over 13k successful runs and no failures.

@yuzefovich yuzefovich closed this Dec 22, 2021
@yuzefovich yuzefovich deleted the debug-ctx-v2 branch December 22, 2021 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge bors won't merge a PR with this label.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants