colexec: fix recent regression with cancellation of inboxes #79716

yuzefovich · 2022-04-09T05:59:36Z

Recent change 773d9ca fixed the way
inboxes handle regular query errors so that now the gRPC streams are not
broken whenever a query error is encountered. However, that change
introduced a regression - it is now possible that the inbox handler
goroutine (the one instantiated to handle FlowStream gRPC call) never
exits when the inbox is an input to the parallel unordered synchronizer.

In particular, the following sequence of events can happen:

the reader goroutine of the inbox receives an error from the
corresponding outbox
this error is propagated to one of the input goroutines of the
unordered synchronizer via a panic. Notably, this is considered
a "graceful" termination from the perspective of the gRPC stream
handling, so the handler goroutine is not notified of this error, and
the inbox is not closed. It is expected that the inbox will be drained
which will close the handler goroutine.
however, the synchronizer input goroutine currently simply receives
the error, propagates it to the coordinator goroutine, and exits,
without performing the draining.

Thus, we get into such a state that the inbox is never drained, so the
handler goroutine will stay alive forever. In particular, this will
block Flow.Wait calls, and Flow.Cleanup will never be called. This
could leak, for example, to leaking file descriptors used by the
temporary disk storage in the vectorized engine.

This fix is quite simple - instead of exiting in step 3, the
synchronizer's input goroutine should just proceed to draining, and only
exit once draining is performed. I believe this was always the
intention, but it didn't really matter until the fix in
773d9ca because draining of the inbox
was a noop since the gRPC stream was prematurely broken.

Fixes: #79469.

Release note: None

cockroach-teamcity · 2022-04-09T05:59:46Z

This change is

cucaroach

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2)

Recent change 773d9ca fixed the way inboxes handle regular query errors so that now the gRPC streams are not broken whenever a query error is encountered. However, that change introduced a regression - it is now possible that the inbox handler goroutine (the one instantiated to handle FlowStream gRPC call) never exits when the inbox is an input to the parallel unordered synchronizer. In particular, the following sequence of events can happen: 1. the reader goroutine of the inbox receives an error from the corresponding outbox 2. this error is propagated to one of the input goroutines of the unordered synchronizer via a panic. Notably, this is considered a "graceful" termination from the perspective of the gRPC stream handling, so the handler goroutine is not notified of this error, and the inbox is not closed. It is expected that the inbox will be drained which will close the handler goroutine. 3. however, the synchronizer input goroutine currently simply receives the error, propagates it to the coordinator goroutine, and exits, without performing the draining. Thus, we get into such a state that the inbox is never drained, so the handler goroutine will stay alive forever. In particular, this will block `Flow.Wait` calls, and `Flow.Cleanup` will never be called. This could leak, for example, to leaking file descriptors used by the temporary disk storage in the vectorized engine. This fix is quite simple - instead of exiting in step 3, the synchronizer's input goroutine should just proceed to draining, and only exit once draining is performed. I believe this was always the intention, but it didn't really matter until the fix in 773d9ca because draining of the inbox was a noop since the gRPC stream was prematurely broken. Release note: None

yuzefovich

Fixed a typo in the comment.

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @michae2)

craig · 2022-04-11T20:21:29Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich mentioned this pull request Apr 9, 2022

roachtest: tpch_concurrency failed #79469

Closed

yuzefovich force-pushed the cancellation-fix branch from 343776a to bb91ffe Compare April 11, 2022 17:22

yuzefovich marked this pull request as ready for review April 11, 2022 17:23

yuzefovich requested review from michae2, cucaroach and a team April 11, 2022 17:23

cucaroach approved these changes Apr 11, 2022

View reviewed changes

yuzefovich force-pushed the cancellation-fix branch from bb91ffe to ecd6403 Compare April 11, 2022 18:25

yuzefovich commented Apr 11, 2022

View reviewed changes

michae2 approved these changes Apr 11, 2022

View reviewed changes

craig bot merged commit e4a9f66 into cockroachdb:master Apr 11, 2022

yuzefovich deleted the cancellation-fix branch April 11, 2022 20:24

yuzefovich mentioned this pull request Apr 13, 2022

roachtest: tpch_concurrency failed #79870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: fix recent regression with cancellation of inboxes #79716

colexec: fix recent regression with cancellation of inboxes #79716

yuzefovich commented Apr 9, 2022 •

edited

Loading

cockroach-teamcity commented Apr 9, 2022

cucaroach left a comment

yuzefovich left a comment

craig bot commented Apr 11, 2022

colexec: fix recent regression with cancellation of inboxes #79716

colexec: fix recent regression with cancellation of inboxes #79716

Conversation

yuzefovich commented Apr 9, 2022 • edited Loading

cockroach-teamcity commented Apr 9, 2022

cucaroach left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Apr 11, 2022

yuzefovich commented Apr 9, 2022 •

edited

Loading