Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flowinfra: cancel remote flows when node is drained #82752

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jun 10, 2022

This commit fixes an oversight in the draining process of the DistSQL
flows. Previously, it was possible for some flows to keep on running
even after server.shutdown.query_wait has passed (which acts as
a grace period to allow queries to complete). This only affects the
distributed queries since local queries are already canceled when the
connections to the node being drained are interrupted.

This commit makes it so that the flow registry actively cancels all
still running flows after the query wait grace period. This is done by
canceling the context of the flow. As a result, distributed queries that
have flows on the node being drained now will result in an error
(previously, they could stall the draining process until they would
complete).

Additionally, this commit fixes an oversight introduced in
5ff1974 so that all flows (except for
fully-local queries) get registered with the flow registry. This matters
for remote flows that don't have any inbound connections (e.g.
SELECT count(*) query or a CDC flow) which would previously by-pass
the flow registry altogether.

In order to have an escape hatch in case the new behavior becomes
problematic, a new private cluster setting is introduced that can
disable the new behavior of canceling the still running flows.

Fixes: #82765.

Release note (bug fix): When a CockroachDB node is being, all queries
that are still running on that node are now forcefully canceled after
waiting the server.shutdown.query_wait period.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

cc @miretskiy - curious if you cherry-pick this PR to try with your test (without the change to the context in the change aggregator).

@yuzefovich yuzefovich force-pushed the flow-drain branch 2 times, most recently from dacf0e6 to 06d4c89 Compare August 3, 2022 01:47
@yuzefovich yuzefovich force-pushed the flow-drain branch 3 times, most recently from 163b0ef to 7ae1e25 Compare August 3, 2022 21:01
@yuzefovich yuzefovich marked this pull request as ready for review August 3, 2022 21:01
@yuzefovich yuzefovich requested a review from a team as a code owner August 3, 2022 21:01
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)


pkg/sql/distsql/server.go line 158 at r1 (raw file):

	"determines whether the queries that are still running on the node that "+
		"is being drained after waiting for 'server.shutdown.query_wait' are "+
		"forcefully canceled",

I think this might read better:
determines whether queries that are still running on a node being drained are forcefully canceled after waiting the 'server.shutdown.query_wait' period.

@yuzefovich yuzefovich force-pushed the flow-drain branch 2 times, most recently from e6ef8d7 to 71c78d1 Compare August 4, 2022 20:57
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a release note.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @mgartner)


pkg/sql/distsql/server.go line 158 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

I think this might read better:
determines whether queries that are still running on a node being drained are forcefully canceled after waiting the 'server.shutdown.query_wait' period.

I agree, done.

This commit fixes an oversight in the draining process of the DistSQL
flows. Previously, it was possible for some flows to keep on running
even after `server.shutdown.query_wait` has passed (which acts as
a grace period to allow queries to complete). This only affects the
distributed queries since local queries are already canceled when the
connections to the node being drained are interrupted.

This commit makes it so that the flow registry actively cancels all
still running flows after the query wait grace period. This is done by
canceling the context of the flow. As a result, distributed queries that
have flows on the node being drained now will result in an error
(previously, they could stall the draining process until they would
complete).

Additionally, this commit fixes an oversight introduced in
5ff1974 so that all flows (except for
fully-local queries) get registered with the flow registry. This matters
for remote flows that don't have any inbound connections (e.g.
`SELECT count(*)` query or a CDC flow) which would previously by-pass
the flow registry altogether.

In order to have an escape hatch in case the new behavior becomes
problematic, a new private cluster setting is introduced that can
disable the new behavior of canceling the still running flows.

Release note (bug fix): When a CockroachDB node is being, all queries
that are still running on that node are now forcefully canceled after
waiting the `server.shutdown.query_wait` period.
@yuzefovich
Copy link
Member Author

Need to increase the max number of settings.

@yuzefovich
Copy link
Member Author

bors r-

@yuzefovich
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 4, 2022

Canceled.

@craig
Copy link
Contributor

craig bot commented Aug 4, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 4, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 5, 2022

Build succeeded:

@craig craig bot merged commit 2087103 into cockroachdb:master Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flowinfra: Do not inhibit server shutdown
3 participants