cli: compute --drain-wait based on cluster setting values #98390

rafiss · 2023-03-10T18:37:21Z

This includes a few commits related to draining

cli: compute --drain-wait based on cluster setting values

Release note (cli change): The --drain-wait argument for the drain command will be automatically increased if the command detects that it is smaller than the sum of server.shutdown.drain_wait, server.shutdown.connection_wait, server.shutdown.query_wait times two, and server.shutdown.lease_transfer_wait.

This recommendation was already documented, but now the advice will be applied automatically.

sql: fix check for closing connExecutor during draining

This fixes a minor bug in which the connection would not get closed at
the earliest possible time during server shutdown.

The connection is supposed to be closed as soon as we handle a Sync
message when the conn_executor is in the draining state and not in a
transaction. Since the transaction state was checked before state
transitions occurred, this would cause the connection to remain open for
an extra bit of time.

roachtest: enhance drain test

The test now does much more:

Checks that --drain-wait is automatically increased if it is set lower
than the cluster settings require.
Check that the /health?ready=1 endpoint fails during the drain_wait
period.
Check for the proper error message during the connection_wait phase.
Check for the proper error message when trying to begin a new
query/transaction during the query_wait phase.
Check that an open transaction is allowed to continue during the
query_wait phase.
Check for the proper error message when a query is canceled during
shutdown.

blathers-crl · 2023-03-10T18:37:26Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-03-10T18:37:32Z

This change is

rafiss · 2023-03-10T18:38:21Z

I'm investigating where this should be tested. It looks like roachtest/tests/drain.go might be best.

ZhouXing19

I reviewed the first and third commit. For the second commit, I'm not familiar with txn state transition so I would like to have others finalize the review.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @rafiss, and @srosenberg)

-- commits line 5 at r1:
According to the code, if the --drain-wait is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."

pkg/cli/rpc_node_shutdown.go line 95 at r1 (raw file):

			"cluster settings require a value of at least %s; using the larger value",
			drainCtx.drainWait, minWait)
		drainCtx.drainWait = minWait + 10*time.Second

I'm not sure why we have these additional 10 seconds here. Also, should we instead log using the lager value {minWait + 10 * time.Second}?

pkg/cmd/roachtest/tests/drain.go line 258 at r2 (raw file):

		}
		if err == nil {
			return errors.New("expected pg_sleep query to fail")

Is this code reachable? if err == nil, then testutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)") is false, and it will just reach return nil.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)

-- commits line 5 at r1:

Previously, ZhouXing19 (Jane Xing) wrote…

According to the code, if the --drain-wait is zero, we don't auto-recompute the wait time. Is this intentional? If so, maybe here it can be updated to "if the command detects that it's of positive duration but smaller than the sum ..."

yes, i should document that. 0 just means "no timeout"

pkg/cli/rpc_node_shutdown.go line 95 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

I'm not sure why we have these additional 10 seconds here. Also, should we instead log using the lager value {minWait + 10 * time.Second}?

it's basically just a little bit of buffer. yeah i will fix that log message

pkg/cmd/roachtest/tests/drain.go line 258 at r2 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Is this code reachable? if err == nil, then testutils.IsError(err, "(query execution canceled|server is shutting down|connection reset by peer|unexpected EOF)") is false, and it will just reach return nil.

whoops, i shouldn't have the !. nice catch

Release note (cli change): The --drain-wait argument for the `drain` command will be automatically increased if the command detects that it is smaller than the sum of server.shutdown.drain_wait, server.shutdown.connection_wait, server.shutdown.query_wait times two, and server.shutdown.lease_transfer_wait. If the --drain-wait argument is 0, then no timeout is used. This recommendation was already documented, but now the advice will be applied automatically.

rafiss · 2023-03-13T22:16:13Z

commit #2 i can explain: it just moves that check to the end of the function, which is after the state transitions occur:

cockroach/pkg/sql/conn_executor.go

Line 2185 in 2d63378

// If an event was generated, feed it to the state machine.

i think you're able to review it :)

knz

Like Jane, i looked at 1st and 3rd commit, and the 2nd one got me a bit confused. Maybe more words to explain what is going on in the commit message would help.

Otherwise i really like this!

Reviewed 5 of 5 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @rafiss, @srosenberg, and @ZhouXing19)

pkg/sql/conn_executor.go line 2131 at r4 (raw file):

			// a Sync is processed.
			defer func() {
				if ex.idleConn() {

Where is the res.Close call moved to?

This fixes a minor bug in which the connection would not get closed at the earliest possible time during server shutdown. The connection is supposed to be closed as soon as we handle a Sync message when the conn_executor is in the draining state and not in a transaction. Since the transaction state was checked before state transitions occurred, this would cause the connection to remain open for an extra bit of time. This was particularly a problem because the Sync message is also the command that auto-commits an implicit transaction. So before this commit, it was actually impossible for the check to work as it was supposed to. Now we check the txn state after state transitions occur. Release note: None

The test now does much more: - Checks that --drain-wait is automatically increased if it is set lower than the cluster settings require. - Check that the /health?ready=1 endpoint fails during the drain_wait period. - Check for the proper error message during the connection_wait phase. - Check for the proper error message when trying to begin a new query/transaction during the query_wait phase. - Check that an open transaction is allowed to continue during the query_wait phase. - Check for the proper error message when a query is canceled during shutdown. Release note: None

rafiss

i've added details to the commit message that should add more clarity.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @knz, @srosenberg, and @ZhouXing19)

pkg/sql/conn_executor.go line 2131 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Where is the res.Close call moved to?

now we use the Close call that occurs during the normal message handling / state transitions, here:

cockroach/pkg/sql/conn_executor.go

Line 2238 in 2d63378

res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()))

knz

thanks indeed this helps.

Reviewed 4 of 4 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @srosenberg, and @ZhouXing19)

rafiss · 2023-03-13T23:20:15Z

tftr!

bors r+

craig · 2023-03-13T23:20:18Z

👎 Rejected by code reviews

rafiss · 2023-03-14T01:36:09Z

bors r+

.

craig · 2023-03-14T02:42:14Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

rafiss · 2023-03-14T04:23:38Z

bors r+

craig · 2023-03-14T04:23:40Z

Already running a review

craig · 2023-03-14T04:49:22Z

Build failed:

Bazel Essential CI (Cockroach)

rafiss · 2023-03-14T12:40:30Z

bors r+

craig · 2023-03-14T13:10:57Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-14T13:11:13Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 18eac87 to blathers/backport-release-22.1-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

error creating merge commit from d6cdd94 to blathers/backport-release-22.2-98390: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

rafiss requested a review from knz March 10, 2023 18:37

rafiss requested a review from a team as a code owner March 10, 2023 18:37

rafiss requested a review from a team as a code owner March 10, 2023 20:19

rafiss requested review from herkolategan and srosenberg and removed request for a team March 10, 2023 20:19

rafiss force-pushed the auto-compute-drain-wait branch 4 times, most recently from 09da9af to a1f019a Compare March 13, 2023 05:07

rafiss requested a review from a team March 13, 2023 05:07

rafiss requested a review from a team as a code owner March 13, 2023 05:07

rafiss force-pushed the auto-compute-drain-wait branch 3 times, most recently from f06bd91 to 50ad228 Compare March 13, 2023 16:21

rafiss requested a review from ZhouXing19 March 13, 2023 16:29

rafiss force-pushed the auto-compute-drain-wait branch from 50ad228 to d2f3830 Compare March 13, 2023 16:37

rafiss added backport-22.1.x labels Mar 13, 2023

ZhouXing19 previously requested changes Mar 13, 2023

View reviewed changes

rafiss commented Mar 13, 2023

View reviewed changes

rafiss force-pushed the auto-compute-drain-wait branch from d2f3830 to 3274081 Compare March 13, 2023 21:59

rafiss requested a review from ZhouXing19 March 13, 2023 21:59

knz reviewed Mar 13, 2023

View reviewed changes

rafiss force-pushed the auto-compute-drain-wait branch from 3274081 to d6cdd94 Compare March 13, 2023 23:00

rafiss commented Mar 13, 2023

View reviewed changes

knz approved these changes Mar 13, 2023

View reviewed changes

craig bot merged commit 19e5845 into cockroachdb:master Mar 14, 2023

This was referenced Mar 14, 2023

release-22.2: cli: compute --drain-wait based on cluster setting values #98577

Merged

release-22.1: cli: compute --drain-wait based on cluster setting values #98578

Merged

rafiss deleted the auto-compute-drain-wait branch March 14, 2023 22:15

cockroach-teamcity mentioned this pull request Mar 15, 2023

PR #98390 - cli: compute --drain-wait based on cluster setting values cockroachdb/docs#16494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: compute --drain-wait based on cluster setting values #98390

cli: compute --drain-wait based on cluster setting values #98390

rafiss commented Mar 10, 2023 •

edited

Loading

blathers-crl bot commented Mar 10, 2023

cockroach-teamcity commented Mar 10, 2023

rafiss commented Mar 10, 2023

ZhouXing19 left a comment

rafiss left a comment

rafiss commented Mar 13, 2023

knz left a comment

rafiss left a comment

knz left a comment

rafiss commented Mar 13, 2023

craig bot commented Mar 13, 2023

rafiss commented Mar 14, 2023

craig bot commented Mar 14, 2023

rafiss commented Mar 14, 2023

craig bot commented Mar 14, 2023

craig bot commented Mar 14, 2023

rafiss commented Mar 14, 2023

craig bot commented Mar 14, 2023

blathers-crl bot commented Mar 14, 2023

cli: compute --drain-wait based on cluster setting values #98390

cli: compute --drain-wait based on cluster setting values #98390

Conversation

rafiss commented Mar 10, 2023 • edited Loading