release-20.1: misc drain/quit improvements #47692

knz · 2020-04-20T11:24:44Z

Backport 11/13 commits from cli,server: make quit/drain more friendly to clusters with many ranges #45149
Backport 1/1 commit from cli/quit: bump the default for --drain-wait #46396
Backport 1/1 commit from server: fix an invalid error unwrap #47696

cc @cockroachdb/release

Planning to hold off on merging this until the final release SHA is selected.

cockroach-teamcity · 2020-04-20T11:25:07Z

This change is

Release justification: fixes for high-priority or high-severity bugs in existing functionality This is one step in a series of changes to make `cockroach quit` more friendly with clusters containing many ranges. For reference, here's the timeline of a drain below. This patch makes the delay in (D) configurable, where it was not configurable previously. A separate commit will look into the hard delay at (E). This patch uses a cluster setting instead of a more clever approach based on extra parameters passed via the client request, or some other cleverness based on the number of ranges active, so as to make the patch lightweight and readily backportable. (Customers want this in 19.1/19.2 to enable no-downtime upgrade to 20.1). ``` adminServer server pgwire distsql storage - | ----------->o server.Undrain() ----------->- server.Drain() -> doDrain() . | . - CLIENT drain starts . - grpc.setMode(draining) . . . = (A) sleep server.shutdown.drain_wait (def 0s) . . . ------>- start draining pgwire clients . . | wait for drain of clients or server.shutdown.query_wait timeout (def 10s) . . . (B) . . - cancel remaining SQL sessions . -<------ . | . ------------>- ds.setDraining() . . | start draining distsql flows . . . wait for flows to drain, with minimum 1s andm max server.shutdown.query_wait (def 10s) . . . (C) . . - cancel remaining flows . -<------------ . | . - drain SQL table leases (performs KV requests) . | . - LEASES drain starts . | . ---------------------->o nodeLiveness.SetDraining() . ---------------------->- for each Store s, SetDraining() . . | . . - attempt transfering 100 leases . . | . . - with exponential backoff (20ms->1s) repeat attempt to transfer 100 leases . . .(D) try for at most server.shutdown.lease_transfer_wait (def 5s) <- NOW CONFIGURABLE . . . . -<---------------------- stop attempting lease transfers -<----------- - stop here if client did not request Shutdown | - grpc.Stop, stopper.Stop() . start of 10 second timeout (non-configurable) . (E) - os.Exit after timeout expires ``` Release note (general change): Prior to this patch, the phase of server shutdown responsible for range lease transfers to other nodes had a hard timeout of 5 seconds. This patch makes this timeout configurable via the new cluster setting `server.shutdown.lease_transfer_wait`. Release note (backward incompatible change): Prior to this patch, the phase of server shutdown responsible for range lease transfers to other nodes would give up after 10000 attempts of transferring replica leases away, regardless of the value of `server.shutdown.lease_transfer_wait`. The limit of 10000 attempts has been removed, so that now only the maximum duration `server.shutdown.lease_transfer_wait` applies.

Release note: None

Release justification: fixes for high-priority or high-severity bugs in existing functionality This is another step in a series of changes to make `cockroach quit` more friendly with clusters containing many ranges, and enabling graceful shutdowns when a separate service manager is responsible for stopping processes. This patch focuses on the client behavior, without considering the behavior server-side. For reference, here's the timeline of a `quit` call below. This patch introduces two things: - it makes the delay in step (X) configurable with `--drain-wait`, where it was not configurable previously. The default remains at 1 minute. - it makes it possible to drain without quitting with a new client command `cockroach node drain`, so that a DBA can look at the admin UI and satisfy themselves that the rest of the cluster has picked up the load. This patch reuses the existing Drain() RPC parameters, so that a `cockroach quit` or `cockroach node drain` client with this patch can be used with a server running a previous version. This is aiming to facilitate upgrades to the latest version on large clusters. ``` quit | - start decommission... . wait for decommission to complete (no timeout) . - end of decommission. | - start draining (no shutdown request)... . (X) wait for drain to complete or timeout specified with --drain-wait (NEW! def. 1m) . - timeout expired or drain completed. | - if 'cockroach node drain' (NEW!) was specified, stop here. | - send shutdown request (with no drain request) - if 1st shutdown fail, try again. | - end. ``` Additionally, the patch simplifies the drain/shutdown code as follows: - it now uses context.WithTimeout / contextutil.RunWithTimeout instead of its own bespoke goroutine / time.After / select dance. (The previous code was implemented before gprc supported context cancellation.) - it separates the code that sends a drain request from the code that sends a shutdown request, thereby dramatically reducing the number of conditionals. - it provides more friendly error messages in case of timeouts and when attempting operations when the cluster has not been initialized yet. Release note (backward-incompatible change): The textual error and warning messages displayed by `cockroach quit` under various circumstances have been updated. Meanwhile, the message "`ok`" remains as indicator that the operation has likely succeeded. Release note (cli change): The time that `cockroach quit` waits client-side for the node to drain (remove existing clients and push range leases away) is now configurable via the command-line flag `--drain-wait`. Note that separate server-side timeouts also apply separately, check the `server.shutdown.*` cluster settings for details. Release note (cli change): It is now possible to drain a node without shutting down the process, using `cockroach node drain`. This makes it easier to integrate with service managers and orchestration: it now becomes safe to issue `cockroach node drain` and then separately stop the service via a process manager or orchestrator. Without this new mode, there is a risk to misconfigure the service manager to auto-restart the node after it shuts down via `quit`, in a way that's surprising or unwanted. The new command `node drain` also recognizes the new `--drain-wait` flag.

Release note: None

Release justification: low risk, high benefit changes to existing functionality Release note: None

Prior to this patch, the Drain RPC was providing much untested flexibility to clients of the RPC: fields on the DrainRequest message that had a lot of possible values, but where only a few of the possible values ever made sense. This flexibility was not tested and I have reasons to suspect that many combinations of parameters would not work. See cockroachdb#46388 for details and a discussion. This patch removes the untested / unused flexibility and simplifies the code accordingly: - for requests: - the `Off` field is removed. It had never been used anyway. - the `On` field was only used ever with values `nil` (probe only = RPC health check) and `[0, 1]`. it is now renamed to `DeprecatedProbeIndicator` and the server now asserts that a non-`nil` input is exactly `[0, 1]`. It is still recognized for compatibility with `cockroach quit` from previous versions. - a new bool field `DoDrain` replaces `On` to indicate whether the request should do a graceful drain or not. - for responses: - the `On` field was only used ever with values `nil` (no drain in progress) and `[0, 1]` (drain currently in progress). The field is renamed `DeprecatedDrainStatus` for compatibility with hypothetical clients from previous versions (AFAIK, no client ever checked that field - maybe we could remove it altogether). - a new bool `IsDraining` reports the current draining status of the server. This patch also introduces a server-side unit test for the Drain RPC, which was missing previously. Release justification: low risk, high benefit changes to existing functionality Release note: None

Prior to this patch, it was impossible for the client of the Drain RPC to know how much was left to do after a Drain completes. In particular, it was impossible to determine whether a timeout was encountered and some load could not be shed. To improve upon this situation, this patch introduces a new field "remaining indicator" in DrainResponse. This value remains non-zero whenever the Drain has work to do or has performed work; its magnitude indicates "how much work" is being performed during the drain and is specified to decrease in successive drain invocations. When it reaches 0, the drain is known to have completed fully. Example output: Before: ``` $ cockroach quit ok ``` After: ``` $ cockroach quit node is draining; remaining: 12 node is draining; remaining: 0 (complete) ok ``` Or alternatively: ``` $ ./cockroach node drain node is draining; remaining: 12 node is draining; remaining: 0 (complete) ok ``` As shown in these examples, `node drain` and `quit` now iterate through drain requests until the remaining indicator reaches zero. The total number of iterations remains controlled by `--drain-wait` in aggregate (0 = no limit). Details are also provided in the log, which can be enabled with an opt-in explicit `--logtostderr=INFO` or some of the other log flags. For example: ``` $ ./cockroach quit --logtostderr=INFO I200331 23:02:05.146857 84 cli/start.go:1434 drain details: SQL clients: 1, table leases: 1, liveness record: 1, range leases: 24 node is shutting down; remaining: 27 node is shutting down; remaining: 0 (complete) I200331 23:02:05.148888 1 util/stop/stopper.go:539 quiescing ok ``` The unit test for this ascertains that: - a first drain request on a test cluster returns a non-zero progress indicator. (There's always at least one unit of work - in the typical case that's the liveness record.) - a second drain request returns a zero progress indicator. (Because under normal circumstances, the timeouts are never reached for an empty test clusters so the graceful drain should always be able to complete in the first request.) Release justification: low risk, high benefit changes to existing functionality Release note (cli change): The commands `cockroach quit` and `cockroach node drain` now reports a "work remaining" metric on their standard error stream. The value reduces until it reaches 0 to indicate that the graceful shutdown has completed server-side. An operator can now rely on `cockroach node drain` to obtain confidence of a graceful shutdown prior to terminating the server process. Release note (backward-incompatible change): `cockroach quit` now prints out progress details on its standard error stream, even when `--logtostderr` is not specified. Previously, nothing was printed on standard error. Scripts who wish to ignore this output can redirect the stderr stream.

@tbg

Recommended by @tbg. Release note: None

This is intended to aid troubleshooting faulty drains. Release note: None

Prior to this patch, when the server closed the connection at the end of the `Drain` RPC the gRPC client in `quit` would get confused with either of the following: ``` ERROR: hard shutdown failed: rpc error: code = Unimplemented desc = grpc: Decompressor is not installed for grpc-encoding "snappy" ``` or ``` ERROR: hard shutdown failed: rpc error: code = Internal desc = grpc: compressed flag set with identity or empty encoding ``` This patch special-cases these two errors. Release note: None

Release note (cli change): The default value of the paramater `--drain-wait` for `cockroach quit` has been increased from 1 minute to 10 minutes, to give more time for nodes with thousands of range to migrate their leases away.

The call to `errors.Cause()` can return `nil` if the error was not wrapped. It's wrapped always (by accident) in the 20.1 tree, but that's not guaranteed/enforced. Release note: None

knz requested a review from tbg April 20, 2020 11:24

knz requested a review from a team as a code owner April 20, 2020 11:24

tbg approved these changes Apr 20, 2020

View reviewed changes

knz mentioned this pull request Apr 21, 2020

roachtest: quit-all-nodes failed #47328

Closed

knz added 13 commits May 2, 2020 15:16

cli: move the quit code to its own file

eb06f17

Release note: None

cli/quit: reorder the functions for better readability

678b697

Release note: None

server: move the drain code to its own file

ac442f2

Release note: None

server: split the body of doDrain() into separate functions

377481c

Release justification: low risk, high benefit changes to existing functionality Release note: None

server: clarify where SQL leases get drain

4e74f2d

Recommended by @tbg. Release note: None

kvserver: add some logging

081dae4

This is intended to aid troubleshooting faulty drains. Release note: None

cli/quit: bump the default for --drain-wait

fbd69da

Release note (cli change): The default value of the paramater `--drain-wait` for `cockroach quit` has been increased from 1 minute to 10 minutes, to give more time for nodes with thousands of range to migrate their leases away.

server: fix an invalid error unwrap

3c5a671

The call to `errors.Cause()` can return `nil` if the error was not wrapped. It's wrapped always (by accident) in the 20.1 tree, but that's not guaranteed/enforced. Release note: None

knz force-pushed the 20200420-backport20.1-45149-46396 branch from 7c48230 to 3c5a671 Compare May 2, 2020 13:16

knz merged commit e27059e into cockroachdb:release-20.1 May 2, 2020

knz deleted the 20200420-backport20.1-45149-46396 branch May 2, 2020 15:05

knz mentioned this pull request May 2, 2020

roachtest: transfer-leases/drain failed #47189

Closed

taroface mentioned this pull request Jun 9, 2020

revise node decommissioning guidance and semantics cockroachdb/docs#7304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: misc drain/quit improvements #47692

release-20.1: misc drain/quit improvements #47692

knz commented Apr 20, 2020 •

edited

Loading

cockroach-teamcity commented Apr 20, 2020

release-20.1: misc drain/quit improvements #47692

release-20.1: misc drain/quit improvements #47692

Conversation

knz commented Apr 20, 2020 • edited Loading

cockroach-teamcity commented Apr 20, 2020

knz commented Apr 20, 2020 •

edited

Loading