cli,server: make quit/drain more friendly to clusters with many ranges #45149

knz · 2020-02-17T22:01:04Z

Fixes #43901.
Fixes #43015.
Fixes #46388.

Release justification: fixes for high-priority or high-severity bugs in existing functionality

Table of contents:

Overview of this PR (for the reviewer's convenience)
Shutdown sequence upon quit
Impact for large deployments
Follow-up work
Release notes

Overview of this PR

the first 4 commits remove arbitrary limits that prevented graceful shutdown on clusters with many ranges, and makes the remaining limits configurable:
- storage: make the lease transfer timeout on shutdown configurable (fixes Ability to specify how long a node waits to transfer leases before performing a hard shut down #43015)
- cli: move the quit code to its own file
- cli/quit: reorder the functions for better readability
- cli: provide more shutdown flexibility in cockroach quit (fixes server,cli: add a --timeout flag to cockroach quit and the Drain RPC #43901)
Additionally, this second commit adds:
- a new command cockroach node drain which creates new operational flexibility: drain the server but keep the process alive. From now on, we want to recommend in docs to only use 'node drain' and then use a process manager / orchestration to stop the server.
- a new option --drain-wait valid for both quit and node drain, which makes the timeout configurable.
(5 release notes, see below)
the two commits after that are code movements to make the subsequent refactoring easier:
- server: move the drain code to its own file
- server: split the body of doDrain() into separate functions
(no release note)
the commit after that fixes server: simplify/rationalize the Drain API #46388 by removing unneeded+untested flexibility in the drain RPC:
- server: simplify/rationalize the Drain API
(no release note)
the commit after that makes cockroach node drain report the progress of the drain on the terminal, so that an operator can ascertain when it's safe to stop the process. It also makes cockroach quit print out a warning if it detects that there is still some state on the server that wasn't drained successfully prior to a shutdown.
- server, *: provide details about the progress of a drain
(1 release note, see below)
The commit after that adds a test that verifies that the first two commits from step (1), combined with the facility in (4), make "large" clusters drain gracefully and in predictable ways.

(The reviewers are invited to examine the individual commit messages for more details, rationales and examples.)

Shutdown sequence

Summary of the shutdown sequence upon cockroach quit - this needs to be documented:

sleep (do nothing) for server.shutdown.drain_wait (def 0s)
wait for disconnect of SQL clients or server.shutdown.query_wait timeout (def 10s)
wait for distSQL flows to stop, with minimum 1s and max server.shutdown.query_wait (def 10s).
Note: this is the same cluster setting, but a different/successive wait.
actively push range leases away, (NEW) for a maximum duration of server.shutdown.lease_transfer_wait (def 5s).
(Previously: for a maximum of 10000 attempts or a hard limit of 5 seconds. The number of attempts is not limited any more and the timeout has become configurable.)
repeat steps 1, 4 until all leases are transferred
stop the process.
With cockroach node drain, the node process remains running so it can be shut down cleanly via orchestration or a service manager.

SEPARATELY, the commands cockroach quit and cockroach node drain both wait for max duration --drain-wait (default 1 minute) for the steps 1-5 to complete.

What it means for "large" deployments

A "large" deployment with many clients and/or many ranges should increase multiple parameters: server.shutdown.query_wait, server.shutdown.lease_transfer_wait and also run cockroach node drain (of possibly quit) with a larger --drain-wait timeout value.

Follow-up work

I have another PR in the works to change the default duration (#46396).

However, I wanted a change that would also work for 19.1/19.2 clusters, because we will want users/customers using these versions to be willing to consider an upgrade to 20.1, and they need a way to gracefully roll over their nodes without the hard timeouts. So this PR provides this, with less UX goodies but the option to backport.

A follow-up action to this code PR is a docs PR that documents how the various timeout parameters stack on top of each other. I have reverse-engineered the code into a timeline (see individual commit messages) and this could be rendered in a more beautiful diagram.

List of release notes

Release note (general change): Prior to this patch, the phase of
server shutdown responsible for range lease transfers to other nodes
had a hard timeout of 5 seconds. This patch makes this timeout
configurable via the new cluster setting
server.shutdown.lease_transfer_wait.

Release note (backward incompatible change): Prior to this patch, the
phase of server shutdown responsible for range lease transfers to
other nodes would give up after 10000 attempts of transferring replica
leases away, regardless of the value of
server.shutdown.lease_transfer_wait. The limit of 10000 attempts has
been removed, so that now only the maximum duration
server.shutdown.lease_transfer_wait applies.

Release note (backward-incompatible change): The textual error and
warning messages displayed by cockroach quit under various
circumstances have been updated. Meanwhile, the message "ok" remains
as indicator that the operation has likely succeeded.

Release note (cli change): The time that cockroach quit waits
client-side for the node to drain (remove existing clients and push
range leases away) is now configurable via the command-line flag
--drain-wait. Note that separate server-side timeouts also apply
separately, check the server.shutdown.* cluster settings for
details.

Release note (cli change): It is now possible to drain a node without
shutting down the process, using cockroach node drain. This makes it
easier to integrate with service managers and orchestration: it now
becomes safe to issue cockroach node drain and then separately stop
the service via a process manager or orchestrator. Without this new
mode, there is a risk to misconfigure the service manager to
auto-restart the node after it shuts down via quit, in a way that's
surprising or unwanted. The new command node drain also
recognizes the new --drain-wait flag.

Release note (cli change): The commands cockroach quit and
cockroach node drain now report a "work remaining" metric on their
standard error stream. The value reduces until it reaches 0 to
indicate that the graceful shutdown has completed server-side. An
operator can now rely on cockroach node drain to obtain confidence
of a graceful shutdown prior to terminating the server process.

Release note (backward-incompatible change): cockroach quit now
prints out progress details on its standard error stream, even when
--logtostderr is not specified. Previously, nothing was printed on
standard error. Scripts who wish to ignore this output can redirect
the stderr stream.

cockroach-teamcity · 2020-02-17T22:01:13Z

This change is

knz · 2020-02-18T15:45:54Z

cc @ajwerner if you could help with the first pass of reviews that would be swell.

ajwerner

Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/start.go, line 1416 at r2 (raw file):

	// via --drain-wait.
	var drainTimeoutCh <-chan time.Time
	if quitCtx.drainWait == 0 {

Should this be > 0?

pkg/cli/start.go, line 1419 at r2 (raw file):

		drainTimeoutCh = time.After(quitCtx.drainWait)
	} else {
		drainTimeoutCh = make(chan time.Time)

nit: a nil channel blocks forever on recv, no need to allocate this

tbg

The default remains at 1 minute.

Is this a good default? I think we want to have a large default, if not outright unlimited, the thinking being that some kind of supervisor should be in charge of enforcing a timeout. But maybe we should discuss and make that change separately.

Code looks good mod existing comments, just needs tests.

Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/cli/start.go, line 1275 at r2 (raw file):

connections, and finally pushes range leases onto other nodes,
subject to various timeout parameters configurable via
cluster settings. After the first stage completes,

I'm on the fence about --only-drain. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using --drain-wait might get themselves into some amount of trouble if they forget to then actually terminate the node.

pkg/cli/start.go, line 1396 at r2 (raw file):

		for {
			if _, err = stream.Recv(); err != nil {
				if grpcutil.IsClosedConnection(err) {

Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.

pkg/cli/start.go, line 1406 at r2 (raw file):

	// If --decommission was passed, perform the decommission as first
	// step.
	if quitCtx.serverDecommission {

Reminds me that we (should) want to deprecate --decommission. But it's probably not worth it.

knz

I just rebased this and addressed the most egregious correctness problems. I'm nowgoing to iterate on this once more about the default delay.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @tbg)

pkg/cli/start.go, line 1396 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.

Oh yes you're right. We're not expecting a conn close here.

pkg/cli/start.go, line 1406 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Reminds me that we (should) want to deprecate --decommission. But it's probably not worth it.

It's deprecated now.

pkg/cli/start.go, line 1416 at r2 (raw file):

Previously, ajwerner wrote…

Should this be > 0?

oops. yes.

pkg/cli/start.go, line 1419 at r2 (raw file):

Previously, ajwerner wrote…

nit: a nil channel blocks forever on recv, no need to allocate this

Thanks.

ajwerner

Sorry I let this slip under my radar for so long. Code . I sympathize with Tobi's concern that our UX around observing the draining state.

Seems that the acceptance tests need some updating.

Reviewed 2 of 9 files at r3, 5 of 6 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)

pkg/cli/start.go, line 1275 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm on the fence about --only-drain. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using --drain-wait might get themselves into some amount of trouble if they forget to then actually terminate the node.

This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without only-drain would be valuable as draining may take a long time.

I'd be okay with a node in only-drain waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.

pkg/cli/start.go, line 1500 at r4 (raw file):

	}

	if !quitCtx.onlyDrain {

Should we log.Info that we've drained the node but are not exiting in an else block?

knz

The default remains at 1 minute.

Is this a good default? I think we want to have a large default [...] maybe we should discuss and make that change separately.

I wanted this PR to be back-portable to 19.x for our customers, where we can't make such a backward-incompatible change. But I can see this default changing for 20.x. Will send a separate PR.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)

pkg/cli/start.go, line 1275 at r2 (raw file):

Previously, ajwerner wrote…

This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without only-drain would be valuable as draining may take a long time.

I'd be okay with a node in only-drain waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.

The reason for --only-drain is because we need a mode for preparing a node to be shut down by a process/service manager. If we did not have that, then cockroach quit would drain the node then terminate the process, and the service manager would then auto-restart it, and make the node live again. That's counter productive. My vision is that we're introducing --only-drain in v20.1, but we'll make it the default in 20.2 or later (and make the shutdown bit opt-in instead).

Adding this explanation to the release note.

I motivated --undrain to cancel a mistaken cockroach quit and bring a node back up. However, further testing reveals this was a bad idea: at that point, there are several background tasks (SQL stats collection, event log reader etc) which were stalled, and letting them resume creates a burst of activity and a catch-up that probably violates the periodicity assumptions of these services. Until we can make --undrain cancel all this backed up work, I don't think we should bring back a node shutting down back from the brink of extinction. So I'm removing that code. (But --only-drain remains for the reason above)

pkg/cli/start.go, line 1500 at r4 (raw file):

Previously, ajwerner wrote…

Should we log.Info that we've drained the node but are not exiting in an else block?

this is the client code. By default the info logs are not displayed. But I see what you mean.
Added logging here and also server-side.

knz · 2020-03-20T23:35:14Z

Ok so the functionality here looks good but I have zero idea how to unit test this properly.

@nvanbenschoten had suggested that I look at some existing test that check that leaseholders get pushed away upon server shutdown. However, I've been searching a bit and I cannot find such tests.

In fact, I cannot find any existing test for the Drain() API or any of the per-component SetDraining methods that it uses. It looks like I'd need to program that testing from scratch.

knz · 2020-03-20T23:40:20Z

I could create an ad-hoc test for --drain-only that checks that the server does not in fact shut down, by asserting that it can continue to receive RPCs after it's done processing the drain without shutdown. But that's not going to test the new timeout behavior. The full testing is subject to my last comment above.

knz · 2020-03-20T23:48:29Z

also FWIW:

Seems that the acceptance tests need some updating.

No they don't. The changes here are behavior-neutral in the default configuration. But I had made a mistake in the code and CI caught it :) - it is now corrected.

knz · 2020-04-07T12:56:24Z

I added a commit at the end with a new roachtest (quit-all-nodes) that asserts that the first 3 nodes out of 5 can quit gracefully, and the 2 remaining nodes out of 5 would time out on a graceful drain.

knz · 2020-04-07T12:59:30Z

I have rebased and squashed all the fixups. Planning to merge this after Teamcity thinks it's good.

knz · 2020-04-07T14:31:17Z

yay it's green

bors r=tbg,ajwerner

tbg · 2020-04-07T14:34:05Z

Many plauses, Raphael. Thanks for pushing this over the finish line.

craig · 2020-04-07T16:00:47Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten · 2020-04-16T03:49:09Z

@knz do you mind reminding me what the plan is for backporting this PR to the release-20.1 release branch? We currently have three roachtest that are regularly failing on the branch (#47254, #47189, #47328). If we're not intending to backport this soon, we should temporarily bump their MinVersion.

knz · 2020-04-16T04:28:51Z

The plan is to backport as soon as the final release SHA is selected. I have removed the release-blocker tag on the 3 issues, do we need to do more? -- Verstuurd vanaf mijn Android apparaat met K-9 Mail. Excuseer mijn beknoptheid.

nvanbenschoten · 2020-04-17T18:27:28Z

do we need to do more?

If we're planning on backporting shortly, then no, this is good enough. Thanks for the update.

knz requested a review from tbg February 17, 2020 22:01

knz requested a review from a team as a code owner February 17, 2020 22:01

knz added backport-19.1.x labels Feb 17, 2020

ajwerner suggested changes Feb 19, 2020

View reviewed changes

tbg reviewed Feb 28, 2020

View reviewed changes

knz force-pushed the 20200217-drain branch from 40cf4b2 to c8681ed Compare March 20, 2020 19:01

knz commented Mar 20, 2020

View reviewed changes

ajwerner reviewed Mar 20, 2020

View reviewed changes

knz force-pushed the 20200217-drain branch from c8681ed to 229f14b Compare March 20, 2020 23:09

knz commented Mar 20, 2020

View reviewed changes

knz mentioned this pull request Mar 20, 2020

server: simplify/rationalize the Drain API #46388

Closed

knz force-pushed the 20200217-drain branch from 229f14b to 4b305cb Compare March 21, 2020 15:39

knz added the docs-todo label Mar 21, 2020

knz force-pushed the 20200217-drain branch from 4b305cb to 0185042 Compare March 21, 2020 16:05

knz mentioned this pull request Mar 21, 2020

cli/quit: bump the default for --drain-wait #46396

Merged

jordanlewis mentioned this pull request Mar 21, 2020

20.1 release blockers list #45599

Closed

83 tasks

knz mentioned this pull request Mar 23, 2020

Create how-to on shutting down a node cockroachdb/docs#6951

Closed

knz force-pushed the 20200217-drain branch 2 times, most recently from e3a176c to 1013730 Compare March 30, 2020 15:08

tbg self-requested a review March 30, 2020 15:13

knz force-pushed the 20200217-drain branch 2 times, most recently from 043b355 to c68ad11 Compare March 30, 2020 18:12

craig bot merged commit 63552e2 into cockroachdb:master Apr 7, 2020

knz deleted the 20200217-drain branch April 7, 2020 16:02

This was referenced Apr 9, 2020

roachtest: transfer-leases/drain failed #47189

Closed

roachtest: transfer-leases/quit failed #47254

Closed

nvanbenschoten mentioned this pull request Apr 16, 2020

roachtest: quit-all-nodes failed #47328

Closed

knz mentioned this pull request May 12, 2020

kv/kvserver: TestRangeTransferLeaseExpirationBased failed #47561

Closed

adityamaru mentioned this pull request Oct 28, 2020

release-20.1: cmd: wholesale replace roachprod, roachtest, workload from master #55892

Merged

taroface mentioned this pull request Nov 30, 2020

clarify node drain vs. quit cockroachdb/docs#9009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli,server: make quit/drain more friendly to clusters with many ranges #45149

cli,server: make quit/drain more friendly to clusters with many ranges #45149

knz commented Feb 17, 2020 •

edited

Loading

cockroach-teamcity commented Feb 17, 2020

knz commented Feb 18, 2020

ajwerner left a comment

tbg left a comment

knz left a comment

ajwerner left a comment

knz left a comment

knz commented Mar 20, 2020

knz commented Mar 20, 2020

knz commented Mar 20, 2020

knz commented Apr 7, 2020

knz commented Apr 7, 2020

knz commented Apr 7, 2020

tbg commented Apr 7, 2020 via email

craig bot commented Apr 7, 2020

nvanbenschoten commented Apr 16, 2020

knz commented Apr 16, 2020 via email

nvanbenschoten commented Apr 17, 2020

cli,server: make quit/drain more friendly to clusters with many ranges #45149

cli,server: make quit/drain more friendly to clusters with many ranges #45149

Conversation

knz commented Feb 17, 2020 • edited Loading

Overview of this PR

Shutdown sequence

What it means for "large" deployments

Follow-up work

List of release notes

cockroach-teamcity commented Feb 17, 2020

knz commented Feb 18, 2020

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Mar 20, 2020

knz commented Mar 20, 2020

knz commented Mar 20, 2020

knz commented Apr 7, 2020

knz commented Apr 7, 2020

knz commented Apr 7, 2020

tbg commented Apr 7, 2020 via email

craig bot commented Apr 7, 2020

Build succeeded

nvanbenschoten commented Apr 16, 2020

knz commented Apr 16, 2020 via email

nvanbenschoten commented Apr 17, 2020

knz commented Feb 17, 2020 •

edited

Loading