Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli,server: make quit/drain more friendly to clusters with many ranges #45149

Merged
merged 13 commits into from
Apr 7, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Feb 17, 2020

Fixes #43901.
Fixes #43015.
Fixes #46388.

Release justification: fixes for high-priority or high-severity bugs in existing functionality

Table of contents:

Overview of this PR

  1. the first 4 commits remove arbitrary limits that prevented graceful shutdown on clusters with many ranges, and makes the remaining limits configurable:

    Additionally, this second commit adds:

    • a new command cockroach node drain which creates new operational flexibility: drain the server but keep the process alive. From now on, we want to recommend in docs to only use 'node drain' and then use a process manager / orchestration to stop the server.
    • a new option --drain-wait valid for both quit and node drain, which makes the timeout configurable.

    (5 release notes, see below)

  2. the two commits after that are code movements to make the subsequent refactoring easier:

    • server: move the drain code to its own file
    • server: split the body of doDrain() into separate functions

    (no release note)

  3. the commit after that fixes server: simplify/rationalize the Drain API #46388 by removing unneeded+untested flexibility in the drain RPC:

    • server: simplify/rationalize the Drain API

    (no release note)

  4. the commit after that makes cockroach node drain report the progress of the drain on the terminal, so that an operator can ascertain when it's safe to stop the process. It also makes cockroach quit print out a warning if it detects that there is still some state on the server that wasn't drained successfully prior to a shutdown.

    • server, *: provide details about the progress of a drain

    (1 release note, see below)

  5. The commit after that adds a test that verifies that the first two commits from step (1), combined with the facility in (4), make "large" clusters drain gracefully and in predictable ways.

(The reviewers are invited to examine the individual commit messages for more details, rationales and examples.)

Shutdown sequence

Summary of the shutdown sequence upon cockroach quit - this needs to be documented:

  1. sleep (do nothing) for server.shutdown.drain_wait (def 0s)
  2. wait for disconnect of SQL clients or server.shutdown.query_wait timeout (def 10s)
  3. wait for distSQL flows to stop, with minimum 1s and max server.shutdown.query_wait (def 10s).
    Note: this is the same cluster setting, but a different/successive wait.
  4. actively push range leases away, (NEW) for a maximum duration of server.shutdown.lease_transfer_wait (def 5s).
    (Previously: for a maximum of 10000 attempts or a hard limit of 5 seconds. The number of attempts is not limited any more and the timeout has become configurable.)
  5. repeat steps 1, 4 until all leases are transferred
  6. stop the process.
    With cockroach node drain, the node process remains running so it can be shut down cleanly via orchestration or a service manager.

SEPARATELY, the commands cockroach quit and cockroach node drain both wait for max duration --drain-wait (default 1 minute) for the steps 1-5 to complete.

What it means for "large" deployments

A "large" deployment with many clients and/or many ranges should increase multiple parameters: server.shutdown.query_wait, server.shutdown.lease_transfer_wait and also run cockroach node drain (of possibly quit) with a larger --drain-wait timeout value.

Follow-up work

I have another PR in the works to change the default duration (#46396).

However, I wanted a change that would also work for 19.1/19.2 clusters, because we will want users/customers using these versions to be willing to consider an upgrade to 20.1, and they need a way to gracefully roll over their nodes without the hard timeouts. So this PR provides this, with less UX goodies but the option to backport.

A follow-up action to this code PR is a docs PR that documents how the various timeout parameters stack on top of each other. I have reverse-engineered the code into a timeline (see individual commit messages) and this could be rendered in a more beautiful diagram.

List of release notes

Release note (general change): Prior to this patch, the phase of
server shutdown responsible for range lease transfers to other nodes
had a hard timeout of 5 seconds. This patch makes this timeout
configurable via the new cluster setting
server.shutdown.lease_transfer_wait.

Release note (backward incompatible change): Prior to this patch, the
phase of server shutdown responsible for range lease transfers to
other nodes would give up after 10000 attempts of transferring replica
leases away, regardless of the value of
server.shutdown.lease_transfer_wait. The limit of 10000 attempts has
been removed, so that now only the maximum duration
server.shutdown.lease_transfer_wait applies.

Release note (backward-incompatible change): The textual error and
warning messages displayed by cockroach quit under various
circumstances have been updated. Meanwhile, the message "ok" remains
as indicator that the operation has likely succeeded.

Release note (cli change): The time that cockroach quit waits
client-side for the node to drain (remove existing clients and push
range leases away) is now configurable via the command-line flag
--drain-wait. Note that separate server-side timeouts also apply
separately, check the server.shutdown.* cluster settings for
details.

Release note (cli change): It is now possible to drain a node without
shutting down the process, using cockroach node drain. This makes it
easier to integrate with service managers and orchestration: it now
becomes safe to issue cockroach node drain and then separately stop
the service via a process manager or orchestrator. Without this new
mode, there is a risk to misconfigure the service manager to
auto-restart the node after it shuts down via quit, in a way that's
surprising or unwanted. The new command node drain also
recognizes the new --drain-wait flag.

Release note (cli change): The commands cockroach quit and
cockroach node drain now report a "work remaining" metric on their
standard error stream. The value reduces until it reaches 0 to
indicate that the graceful shutdown has completed server-side. An
operator can now rely on cockroach node drain to obtain confidence
of a graceful shutdown prior to terminating the server process.

Release note (backward-incompatible change): cockroach quit now
prints out progress details on its standard error stream, even when
--logtostderr is not specified. Previously, nothing was printed on
standard error. Scripts who wish to ignore this output can redirect
the stderr stream.

@knz knz requested a review from tbg February 17, 2020 22:01
@knz knz requested a review from a team as a code owner February 17, 2020 22:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Feb 18, 2020

cc @ajwerner if you could help with the first pass of reviews that would be swell.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/start.go, line 1416 at r2 (raw file):

	// via --drain-wait.
	var drainTimeoutCh <-chan time.Time
	if quitCtx.drainWait == 0 {

Should this be > 0?


pkg/cli/start.go, line 1419 at r2 (raw file):

		drainTimeoutCh = time.After(quitCtx.drainWait)
	} else {
		drainTimeoutCh = make(chan time.Time)

nit: a nil channel blocks forever on recv, no need to allocate this

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default remains at 1 minute.

Is this a good default? I think we want to have a large default, if not outright unlimited, the thinking being that some kind of supervisor should be in charge of enforcing a timeout. But maybe we should discuss and make that change separately.

Code looks good mod existing comments, just needs tests.

Reviewed 2 of 2 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/cli/start.go, line 1275 at r2 (raw file):

connections, and finally pushes range leases onto other nodes,
subject to various timeout parameters configurable via
cluster settings. After the first stage completes,

I'm on the fence about --only-drain. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using --drain-wait might get themselves into some amount of trouble if they forget to then actually terminate the node.


pkg/cli/start.go, line 1396 at r2 (raw file):

		for {
			if _, err = stream.Recv(); err != nil {
				if grpcutil.IsClosedConnection(err) {

Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.


pkg/cli/start.go, line 1406 at r2 (raw file):

	// If --decommission was passed, perform the decommission as first
	// step.
	if quitCtx.serverDecommission {

Reminds me that we (should) want to deprecate --decommission. But it's probably not worth it.

@knz knz force-pushed the 20200217-drain branch from 40cf4b2 to c8681ed Compare March 20, 2020 19:01
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just rebased this and addressed the most egregious correctness problems. I'm nowgoing to iterate on this once more about the default delay.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @tbg)


pkg/cli/start.go, line 1396 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't it a problem, if the server crashes while undraining? This looks like it's OK with the server shutting down but we're not. We're undraining.

Oh yes you're right. We're not expecting a conn close here.


pkg/cli/start.go, line 1406 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Reminds me that we (should) want to deprecate --decommission. But it's probably not worth it.

It's deprecated now.


pkg/cli/start.go, line 1416 at r2 (raw file):

Previously, ajwerner wrote…

Should this be > 0?

oops. yes.


pkg/cli/start.go, line 1419 at r2 (raw file):

Previously, ajwerner wrote…

nit: a nil channel blocks forever on recv, no need to allocate this

Thanks.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I let this slip under my radar for so long. Code :lgtm:. I sympathize with Tobi's concern that our UX around observing the draining state.

Seems that the acceptance tests need some updating.

Reviewed 2 of 9 files at r3, 5 of 6 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/cli/start.go, line 1275 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm on the fence about --only-drain. How is a node that has drained portrayed in the admin ui? I'm worried that it will look mostly live, but it is not accepting any leases, and so operators using --drain-wait might get themselves into some amount of trouble if they forget to then actually terminate the node.

This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without only-drain would be valuable as draining may take a long time.

I'd be okay with a node in only-drain waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.


pkg/cli/start.go, line 1500 at r4 (raw file):

	}

	if !quitCtx.onlyDrain {

Should we log.Info that we've drained the node but are not exiting in an else block?

@knz knz force-pushed the 20200217-drain branch from c8681ed to 229f14b Compare March 20, 2020 23:09
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default remains at 1 minute.

Is this a good default? I think we want to have a large default [...] maybe we should discuss and make that change separately.

I wanted this PR to be back-portable to 19.x for our customers, where we can't make such a backward-incompatible change. But I can see this default changing for 20.x. Will send a separate PR.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/cli/start.go, line 1275 at r2 (raw file):

Previously, ajwerner wrote…

This seems like a reasonable concern. We could add a new node state but it feels like a pretty heavy lift. All that being said, it feels like a draining state even without only-drain would be valuable as draining may take a long time.

I'd be okay with a node in only-drain waiting to be quit explicitly appearing the same in the admin UI as a node which is in the process of draining.

The reason for --only-drain is because we need a mode for preparing a node to be shut down by a process/service manager. If we did not have that, then cockroach quit would drain the node then terminate the process, and the service manager would then auto-restart it, and make the node live again. That's counter productive. My vision is that we're introducing --only-drain in v20.1, but we'll make it the default in 20.2 or later (and make the shutdown bit opt-in instead).

Adding this explanation to the release note.

I motivated --undrain to cancel a mistaken cockroach quit and bring a node back up. However, further testing reveals this was a bad idea: at that point, there are several background tasks (SQL stats collection, event log reader etc) which were stalled, and letting them resume creates a burst of activity and a catch-up that probably violates the periodicity assumptions of these services. Until we can make --undrain cancel all this backed up work, I don't think we should bring back a node shutting down back from the brink of extinction. So I'm removing that code. (But --only-drain remains for the reason above)


pkg/cli/start.go, line 1500 at r4 (raw file):

Previously, ajwerner wrote…

Should we log.Info that we've drained the node but are not exiting in an else block?

this is the client code. By default the info logs are not displayed. But I see what you mean.
Added logging here and also server-side.

@knz
Copy link
Contributor Author

knz commented Mar 20, 2020

Ok so the functionality here looks good but I have zero idea how to unit test this properly.

@nvanbenschoten had suggested that I look at some existing test that check that leaseholders get pushed away upon server shutdown. However, I've been searching a bit and I cannot find such tests.

In fact, I cannot find any existing test for the Drain() API or any of the per-component SetDraining methods that it uses. It looks like I'd need to program that testing from scratch.

@knz
Copy link
Contributor Author

knz commented Mar 20, 2020

I could create an ad-hoc test for --drain-only that checks that the server does not in fact shut down, by asserting that it can continue to receive RPCs after it's done processing the drain without shutdown. But that's not going to test the new timeout behavior. The full testing is subject to my last comment above.

@knz
Copy link
Contributor Author

knz commented Mar 20, 2020

also FWIW:

Seems that the acceptance tests need some updating.

No they don't. The changes here are behavior-neutral in the default configuration. But I had made a mistake in the code and CI caught it :) - it is now corrected.

@knz knz force-pushed the 20200217-drain branch from 229f14b to 4b305cb Compare March 21, 2020 15:39
@knz knz added the docs-todo label Mar 21, 2020
@knz knz force-pushed the 20200217-drain branch from 4b305cb to 0185042 Compare March 21, 2020 16:05
@jordanlewis jordanlewis mentioned this pull request Mar 21, 2020
83 tasks
@knz knz force-pushed the 20200217-drain branch 2 times, most recently from e3a176c to 1013730 Compare March 30, 2020 15:08
@tbg tbg self-requested a review March 30, 2020 15:13
@knz knz force-pushed the 20200217-drain branch 2 times, most recently from 043b355 to c68ad11 Compare March 30, 2020 18:12
@knz
Copy link
Contributor Author

knz commented Apr 7, 2020

I added a commit at the end with a new roachtest (quit-all-nodes) that asserts that the first 3 nodes out of 5 can quit gracefully, and the 2 remaining nodes out of 5 would time out on a graceful drain.

@knz
Copy link
Contributor Author

knz commented Apr 7, 2020

I have rebased and squashed all the fixups. Planning to merge this after Teamcity thinks it's good.

@knz
Copy link
Contributor Author

knz commented Apr 7, 2020

yay it's green

bors r=tbg,ajwerner

@tbg
Copy link
Member

tbg commented Apr 7, 2020 via email

@craig
Copy link
Contributor

craig bot commented Apr 7, 2020

Build succeeded

@craig craig bot merged commit 63552e2 into cockroachdb:master Apr 7, 2020
@knz knz deleted the 20200217-drain branch April 7, 2020 16:02
@nvanbenschoten
Copy link
Member

@knz do you mind reminding me what the plan is for backporting this PR to the release-20.1 release branch? We currently have three roachtest that are regularly failing on the branch (#47254, #47189, #47328). If we're not intending to backport this soon, we should temporarily bump their MinVersion.

@knz
Copy link
Contributor Author

knz commented Apr 16, 2020 via email

@nvanbenschoten
Copy link
Member

do we need to do more?

If we're planning on backporting shortly, then no, this is good enough. Thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants