Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit fixes a long standing bug in the distributed vectorized query shutdown where in case of a graceful completion of the flow on one node, we might get an error on another node resulting in the ungraceful termination of the query. This was caused by the fact that on remote nodes the last outbox to exit would cancel the flow context; however, when instantiating `FlowStream` RPC the outboxes used a child context of the flow context, so that "graceful" cancellation of the flow context would cause the inbox to get an ungraceful termination of the gRPC stream. As a result, the whole query could get "context canceled" error. I believe this bug was introduced by me over two years ago because I didn't fully understand how the shutdown should work, and in particular I was missing that when an inbox observes the flow context cancellation, it should terminate the `FlowStream` RPC ungracefully in order to propagate the ungracefullness to the other side of the stream. This shortcoming was fixed in the previous commit. Another possible bug was caused by the outbox canceling its own context in case of a graceful shutdown. As mentioned above, `FlowStream` RPC was issued using the outbox context, so there was a possibility of a race between `CloseSend` call being delivered to the inbox (graceful termination) and the context of the RPC being canceled (ungraceful termination). Both of these problems are now fixed, and the shutdown protocol now is as follows: - on the gateway node we keep on canceling the flow context at the very end of the query execution. It doesn't matter whether the query resulted in an error or not, and doing so allows us to ensure that everything exits on the gateway node. This behavior is already present. - due to the fix in a previous commit, that flow context cancellation terminates ungracefully all still open gRPC streams for `FlowStream` RPC for which the gateway node is the inbox host. - the outboxes on the remote nodes get the ungraceful termination and cancel the flow context of their hosts. This, in turn, would trigger propagation of the ungraceful termination on other gRPC streams, etc. - whenever an outbox exits gracefully, it cancels its own context, but the gRPC stream uses the flow context, so the stream is still alive. I debated a bit whether we want to keep this outbox context cancellation in case of a graceful completion and decided to keep it to minimize the scope of changes. Release note (bug fix): Previously, CockroachDB could return a spurious "context canceled" error for a query that actually succeeded in extremely rare cases, and this has now been fixed.
- Loading branch information