flowinfra: cancel remote flows when node is drained #82752

yuzefovich · 2022-06-10T21:10:25Z

This commit fixes an oversight in the draining process of the DistSQL
flows. Previously, it was possible for some flows to keep on running
even after server.shutdown.query_wait has passed (which acts as
a grace period to allow queries to complete). This only affects the
distributed queries since local queries are already canceled when the
connections to the node being drained are interrupted.

This commit makes it so that the flow registry actively cancels all
still running flows after the query wait grace period. This is done by
canceling the context of the flow. As a result, distributed queries that
have flows on the node being drained now will result in an error
(previously, they could stall the draining process until they would
complete).

Additionally, this commit fixes an oversight introduced in
5ff1974 so that all flows (except for
fully-local queries) get registered with the flow registry. This matters
for remote flows that don't have any inbound connections (e.g.
SELECT count(*) query or a CDC flow) which would previously by-pass
the flow registry altogether.

In order to have an escape hatch in case the new behavior becomes
problematic, a new private cluster setting is introduced that can
disable the new behavior of canceling the still running flows.

Fixes: #82765.

Release note (bug fix): When a CockroachDB node is being, all queries
that are still running on that node are now forcefully canceled after
waiting the server.shutdown.query_wait period.

cockroach-teamcity · 2022-06-10T21:10:35Z

This change is

yuzefovich · 2022-06-10T21:11:49Z

cc @miretskiy - curious if you cherry-pick this PR to try with your test (without the change to the context in the change aggregator).

cucaroach

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)

pkg/sql/distsql/server.go line 158 at r1 (raw file):

	"determines whether the queries that are still running on the node that "+
		"is being drained after waiting for 'server.shutdown.query_wait' are "+
		"forcefully canceled",

I think this might read better:
determines whether queries that are still running on a node being drained are forcefully canceled after waiting the 'server.shutdown.query_wait' period.

yuzefovich

Added a release note.

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @mgartner)

pkg/sql/distsql/server.go line 158 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

I think this might read better:
determines whether queries that are still running on a node being drained are forcefully canceled after waiting the 'server.shutdown.query_wait' period.

I agree, done.

This commit fixes an oversight in the draining process of the DistSQL flows. Previously, it was possible for some flows to keep on running even after `server.shutdown.query_wait` has passed (which acts as a grace period to allow queries to complete). This only affects the distributed queries since local queries are already canceled when the connections to the node being drained are interrupted. This commit makes it so that the flow registry actively cancels all still running flows after the query wait grace period. This is done by canceling the context of the flow. As a result, distributed queries that have flows on the node being drained now will result in an error (previously, they could stall the draining process until they would complete). Additionally, this commit fixes an oversight introduced in 5ff1974 so that all flows (except for fully-local queries) get registered with the flow registry. This matters for remote flows that don't have any inbound connections (e.g. `SELECT count(*)` query or a CDC flow) which would previously by-pass the flow registry altogether. In order to have an escape hatch in case the new behavior becomes problematic, a new private cluster setting is introduced that can disable the new behavior of canceling the still running flows. Release note (bug fix): When a CockroachDB node is being, all queries that are still running on that node are now forcefully canceled after waiting the `server.shutdown.query_wait` period.

yuzefovich · 2022-08-04T21:58:56Z

Need to increase the max number of settings.

yuzefovich · 2022-08-04T21:59:01Z

bors r-

yuzefovich · 2022-08-04T21:59:28Z

bors r+

craig · 2022-08-04T22:02:16Z

Canceled.

craig · 2022-08-04T22:30:58Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-04T23:51:42Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-05T00:55:32Z

Build succeeded:

Bazel Essential CI (Cockroach)

yuzefovich force-pushed the flow-drain branch from 1b4ec77 to 5487e84 Compare June 10, 2022 21:10

yuzefovich force-pushed the flow-drain branch from 5487e84 to d581e5e Compare June 10, 2022 22:30

yuzefovich force-pushed the flow-drain branch 2 times, most recently from dacf0e6 to 06d4c89 Compare August 3, 2022 01:47

yuzefovich mentioned this pull request Aug 3, 2022

flowinfra: preserve flowRetryableError correctly across network #85500

Merged

yuzefovich force-pushed the flow-drain branch 3 times, most recently from 163b0ef to 7ae1e25 Compare August 3, 2022 21:01

yuzefovich marked this pull request as ready for review August 3, 2022 21:01

yuzefovich requested a review from a team as a code owner August 3, 2022 21:01

yuzefovich requested review from mgartner and cucaroach August 3, 2022 21:01

cucaroach approved these changes Aug 4, 2022

View reviewed changes

yuzefovich force-pushed the flow-drain branch 2 times, most recently from e6ef8d7 to 71c78d1 Compare August 4, 2022 20:57

yuzefovich commented Aug 4, 2022

View reviewed changes

yuzefovich force-pushed the flow-drain branch from 71c78d1 to b27844f Compare August 4, 2022 21:59

craig bot merged commit 2087103 into cockroachdb:master Aug 5, 2022

yuzefovich deleted the flow-drain branch August 5, 2022 00:57

yuzefovich mentioned this pull request Sep 19, 2022

release-22.1: flowinfra: cancel remote flows when node is drained #88150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flowinfra: cancel remote flows when node is drained #82752

flowinfra: cancel remote flows when node is drained #82752

yuzefovich commented Jun 10, 2022 •

edited

Loading

cockroach-teamcity commented Jun 10, 2022

yuzefovich commented Jun 10, 2022

cucaroach left a comment

yuzefovich left a comment

yuzefovich commented Aug 4, 2022

yuzefovich commented Aug 4, 2022

yuzefovich commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 5, 2022

flowinfra: cancel remote flows when node is drained #82752

flowinfra: cancel remote flows when node is drained #82752

Conversation

yuzefovich commented Jun 10, 2022 • edited Loading

cockroach-teamcity commented Jun 10, 2022

yuzefovich commented Jun 10, 2022

cucaroach left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich commented Aug 4, 2022

yuzefovich commented Aug 4, 2022

yuzefovich commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 4, 2022

craig bot commented Aug 5, 2022

yuzefovich commented Jun 10, 2022 •

edited

Loading