Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: fix recent regression with cancellation of inboxes #79716

Merged
merged 1 commit into from
Apr 11, 2022

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Apr 9, 2022

Recent change 773d9ca fixed the way
inboxes handle regular query errors so that now the gRPC streams are not
broken whenever a query error is encountered. However, that change
introduced a regression - it is now possible that the inbox handler
goroutine (the one instantiated to handle FlowStream gRPC call) never
exits when the inbox is an input to the parallel unordered synchronizer.

In particular, the following sequence of events can happen:

  1. the reader goroutine of the inbox receives an error from the
    corresponding outbox
  2. this error is propagated to one of the input goroutines of the
    unordered synchronizer via a panic. Notably, this is considered
    a "graceful" termination from the perspective of the gRPC stream
    handling, so the handler goroutine is not notified of this error, and
    the inbox is not closed. It is expected that the inbox will be drained
    which will close the handler goroutine.
  3. however, the synchronizer input goroutine currently simply receives
    the error, propagates it to the coordinator goroutine, and exits,
    without performing the draining.

Thus, we get into such a state that the inbox is never drained, so the
handler goroutine will stay alive forever. In particular, this will
block Flow.Wait calls, and Flow.Cleanup will never be called. This
could leak, for example, to leaking file descriptors used by the
temporary disk storage in the vectorized engine.

This fix is quite simple - instead of exiting in step 3, the
synchronizer's input goroutine should just proceed to draining, and only
exit once draining is performed. I believe this was always the
intention, but it didn't really matter until the fix in
773d9ca because draining of the inbox
was a noop since the gRPC stream was prematurely broken.

Fixes: #79469.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich marked this pull request as ready for review April 11, 2022 17:23
@yuzefovich yuzefovich requested review from michae2, cucaroach and a team April 11, 2022 17:23
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)

Recent change 773d9ca fixed the way
inboxes handle regular query errors so that now the gRPC streams are not
broken whenever a query error is encountered. However, that change
introduced a regression - it is now possible that the inbox handler
goroutine (the one instantiated to handle FlowStream gRPC call) never
exits when the inbox is an input to the parallel unordered synchronizer.

In particular, the following sequence of events can happen:
1. the reader goroutine of the inbox receives an error from the
corresponding outbox
2. this error is propagated to one of the input goroutines of the
unordered synchronizer via a panic. Notably, this is considered
a "graceful" termination from the perspective of the gRPC stream
handling, so the handler goroutine is not notified of this error, and
the inbox is not closed. It is expected that the inbox will be drained
which will close the handler goroutine.
3. however, the synchronizer input goroutine currently simply receives
the error, propagates it to the coordinator goroutine, and exits,
without performing the draining.

Thus, we get into such a state that the inbox is never drained, so the
handler goroutine will stay alive forever. In particular, this will
block `Flow.Wait` calls, and `Flow.Cleanup` will never be called. This
could leak, for example, to leaking file descriptors used by the
temporary disk storage in the vectorized engine.

This fix is quite simple - instead of exiting in step 3, the
synchronizer's input goroutine should just proceed to draining, and only
exit once draining is performed. I believe this was always the
intention, but it didn't really matter until the fix in
773d9ca because draining of the inbox
was a noop since the gRPC stream was prematurely broken.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a typo in the comment.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @michae2)

@craig
Copy link
Contributor

craig bot commented Apr 11, 2022

Build succeeded:

@craig craig bot merged commit e4a9f66 into cockroachdb:master Apr 11, 2022
@yuzefovich yuzefovich deleted the cancellation-fix branch April 11, 2022 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpch_concurrency failed
4 participants