-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-20.1: misc drain/quit improvements #47692
Merged
knz
merged 13 commits into
cockroachdb:release-20.1
from
knz:20200420-backport20.1-45149-46396
May 2, 2020
Merged
release-20.1: misc drain/quit improvements #47692
knz
merged 13 commits into
cockroachdb:release-20.1
from
knz:20200420-backport20.1-45149-46396
May 2, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tbg
approved these changes
Apr 20, 2020
Release justification: fixes for high-priority or high-severity bugs in existing functionality This is one step in a series of changes to make `cockroach quit` more friendly with clusters containing many ranges. For reference, here's the timeline of a drain below. This patch makes the delay in (D) configurable, where it was not configurable previously. A separate commit will look into the hard delay at (E). This patch uses a cluster setting instead of a more clever approach based on extra parameters passed via the client request, or some other cleverness based on the number of ranges active, so as to make the patch lightweight and readily backportable. (Customers want this in 19.1/19.2 to enable no-downtime upgrade to 20.1). ``` adminServer server pgwire distsql storage - | ----------->o server.Undrain() ----------->- server.Drain() -> doDrain() . | . - CLIENT drain starts . - grpc.setMode(draining) . . . = (A) sleep server.shutdown.drain_wait (def 0s) . . . ------>- start draining pgwire clients . . | wait for drain of clients or server.shutdown.query_wait timeout (def 10s) . . . (B) . . - cancel remaining SQL sessions . -<------ . | . ------------>- ds.setDraining() . . | start draining distsql flows . . . wait for flows to drain, with minimum 1s andm max server.shutdown.query_wait (def 10s) . . . (C) . . - cancel remaining flows . -<------------ . | . - drain SQL table leases (performs KV requests) . | . - LEASES drain starts . | . ---------------------->o nodeLiveness.SetDraining() . ---------------------->- for each Store s, SetDraining() . . | . . - attempt transfering 100 leases . . | . . - with exponential backoff (20ms->1s) repeat attempt to transfer 100 leases . . .(D) try for at most server.shutdown.lease_transfer_wait (def 5s) <- NOW CONFIGURABLE . . . . -<---------------------- stop attempting lease transfers -<----------- - stop here if client did not request Shutdown | - grpc.Stop, stopper.Stop() . start of 10 second timeout (non-configurable) . (E) - os.Exit after timeout expires ``` Release note (general change): Prior to this patch, the phase of server shutdown responsible for range lease transfers to other nodes had a hard timeout of 5 seconds. This patch makes this timeout configurable via the new cluster setting `server.shutdown.lease_transfer_wait`. Release note (backward incompatible change): Prior to this patch, the phase of server shutdown responsible for range lease transfers to other nodes would give up after 10000 attempts of transferring replica leases away, regardless of the value of `server.shutdown.lease_transfer_wait`. The limit of 10000 attempts has been removed, so that now only the maximum duration `server.shutdown.lease_transfer_wait` applies.
Release note: None
Release note: None
Release justification: fixes for high-priority or high-severity bugs in existing functionality This is another step in a series of changes to make `cockroach quit` more friendly with clusters containing many ranges, and enabling graceful shutdowns when a separate service manager is responsible for stopping processes. This patch focuses on the client behavior, without considering the behavior server-side. For reference, here's the timeline of a `quit` call below. This patch introduces two things: - it makes the delay in step (X) configurable with `--drain-wait`, where it was not configurable previously. The default remains at 1 minute. - it makes it possible to drain without quitting with a new client command `cockroach node drain`, so that a DBA can look at the admin UI and satisfy themselves that the rest of the cluster has picked up the load. This patch reuses the existing Drain() RPC parameters, so that a `cockroach quit` or `cockroach node drain` client with this patch can be used with a server running a previous version. This is aiming to facilitate upgrades to the latest version on large clusters. ``` quit | - start decommission... . wait for decommission to complete (no timeout) . - end of decommission. | - start draining (no shutdown request)... . (X) wait for drain to complete or timeout specified with --drain-wait (NEW! def. 1m) . - timeout expired or drain completed. | - if 'cockroach node drain' (NEW!) was specified, stop here. | - send shutdown request (with no drain request) - if 1st shutdown fail, try again. | - end. ``` Additionally, the patch simplifies the drain/shutdown code as follows: - it now uses context.WithTimeout / contextutil.RunWithTimeout instead of its own bespoke goroutine / time.After / select dance. (The previous code was implemented before gprc supported context cancellation.) - it separates the code that sends a drain request from the code that sends a shutdown request, thereby dramatically reducing the number of conditionals. - it provides more friendly error messages in case of timeouts and when attempting operations when the cluster has not been initialized yet. Release note (backward-incompatible change): The textual error and warning messages displayed by `cockroach quit` under various circumstances have been updated. Meanwhile, the message "`ok`" remains as indicator that the operation has likely succeeded. Release note (cli change): The time that `cockroach quit` waits client-side for the node to drain (remove existing clients and push range leases away) is now configurable via the command-line flag `--drain-wait`. Note that separate server-side timeouts also apply separately, check the `server.shutdown.*` cluster settings for details. Release note (cli change): It is now possible to drain a node without shutting down the process, using `cockroach node drain`. This makes it easier to integrate with service managers and orchestration: it now becomes safe to issue `cockroach node drain` and then separately stop the service via a process manager or orchestrator. Without this new mode, there is a risk to misconfigure the service manager to auto-restart the node after it shuts down via `quit`, in a way that's surprising or unwanted. The new command `node drain` also recognizes the new `--drain-wait` flag.
Release note: None
Release justification: low risk, high benefit changes to existing functionality Release note: None
Prior to this patch, the Drain RPC was providing much untested flexibility to clients of the RPC: fields on the DrainRequest message that had a lot of possible values, but where only a few of the possible values ever made sense. This flexibility was not tested and I have reasons to suspect that many combinations of parameters would not work. See cockroachdb#46388 for details and a discussion. This patch removes the untested / unused flexibility and simplifies the code accordingly: - for requests: - the `Off` field is removed. It had never been used anyway. - the `On` field was only used ever with values `nil` (probe only = RPC health check) and `[0, 1]`. it is now renamed to `DeprecatedProbeIndicator` and the server now asserts that a non-`nil` input is exactly `[0, 1]`. It is still recognized for compatibility with `cockroach quit` from previous versions. - a new bool field `DoDrain` replaces `On` to indicate whether the request should do a graceful drain or not. - for responses: - the `On` field was only used ever with values `nil` (no drain in progress) and `[0, 1]` (drain currently in progress). The field is renamed `DeprecatedDrainStatus` for compatibility with hypothetical clients from previous versions (AFAIK, no client ever checked that field - maybe we could remove it altogether). - a new bool `IsDraining` reports the current draining status of the server. This patch also introduces a server-side unit test for the Drain RPC, which was missing previously. Release justification: low risk, high benefit changes to existing functionality Release note: None
Prior to this patch, it was impossible for the client of the Drain RPC to know how much was left to do after a Drain completes. In particular, it was impossible to determine whether a timeout was encountered and some load could not be shed. To improve upon this situation, this patch introduces a new field "remaining indicator" in DrainResponse. This value remains non-zero whenever the Drain has work to do or has performed work; its magnitude indicates "how much work" is being performed during the drain and is specified to decrease in successive drain invocations. When it reaches 0, the drain is known to have completed fully. Example output: Before: ``` $ cockroach quit ok ``` After: ``` $ cockroach quit node is draining; remaining: 12 node is draining; remaining: 0 (complete) ok ``` Or alternatively: ``` $ ./cockroach node drain node is draining; remaining: 12 node is draining; remaining: 0 (complete) ok ``` As shown in these examples, `node drain` and `quit` now iterate through drain requests until the remaining indicator reaches zero. The total number of iterations remains controlled by `--drain-wait` in aggregate (0 = no limit). Details are also provided in the log, which can be enabled with an opt-in explicit `--logtostderr=INFO` or some of the other log flags. For example: ``` $ ./cockroach quit --logtostderr=INFO I200331 23:02:05.146857 84 cli/start.go:1434 drain details: SQL clients: 1, table leases: 1, liveness record: 1, range leases: 24 node is shutting down; remaining: 27 node is shutting down; remaining: 0 (complete) I200331 23:02:05.148888 1 util/stop/stopper.go:539 quiescing ok ``` The unit test for this ascertains that: - a first drain request on a test cluster returns a non-zero progress indicator. (There's always at least one unit of work - in the typical case that's the liveness record.) - a second drain request returns a zero progress indicator. (Because under normal circumstances, the timeouts are never reached for an empty test clusters so the graceful drain should always be able to complete in the first request.) Release justification: low risk, high benefit changes to existing functionality Release note (cli change): The commands `cockroach quit` and `cockroach node drain` now reports a "work remaining" metric on their standard error stream. The value reduces until it reaches 0 to indicate that the graceful shutdown has completed server-side. An operator can now rely on `cockroach node drain` to obtain confidence of a graceful shutdown prior to terminating the server process. Release note (backward-incompatible change): `cockroach quit` now prints out progress details on its standard error stream, even when `--logtostderr` is not specified. Previously, nothing was printed on standard error. Scripts who wish to ignore this output can redirect the stderr stream.
Recommended by @tbg. Release note: None
This is intended to aid troubleshooting faulty drains. Release note: None
Prior to this patch, when the server closed the connection at the end of the `Drain` RPC the gRPC client in `quit` would get confused with either of the following: ``` ERROR: hard shutdown failed: rpc error: code = Unimplemented desc = grpc: Decompressor is not installed for grpc-encoding "snappy" ``` or ``` ERROR: hard shutdown failed: rpc error: code = Internal desc = grpc: compressed flag set with identity or empty encoding ``` This patch special-cases these two errors. Release note: None
Release note (cli change): The default value of the paramater `--drain-wait` for `cockroach quit` has been increased from 1 minute to 10 minutes, to give more time for nodes with thousands of range to migrate their leases away.
The call to `errors.Cause()` can return `nil` if the error was not wrapped. It's wrapped always (by accident) in the 20.1 tree, but that's not guaranteed/enforced. Release note: None
knz
force-pushed
the
20200420-backport20.1-45149-46396
branch
from
May 2, 2020 13:16
7c48230
to
3c5a671
Compare
This was referenced May 27, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
cc @cockroachdb/release
Fixes #47189
Fixes #47692
Planning to hold off on merging this until the final release SHA is selected.