Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: misc drain/quit improvements #47692

Merged
merged 13 commits into from
May 2, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 20, 2020

cc @cockroachdb/release


Fixes #47189
Fixes #47692

Planning to hold off on merging this until the final release SHA is selected.

@knz knz requested a review from tbg April 20, 2020 11:24
@knz knz requested a review from a team as a code owner April 20, 2020 11:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

knz added 13 commits May 2, 2020 15:16
Release justification: fixes for high-priority or high-severity bugs in existing functionality

This is one step in a series of changes to make `cockroach quit` more
friendly with clusters containing many ranges.

For reference, here's the timeline of a drain below.  This patch makes
the delay in (D) configurable, where it was not configurable
previously. A separate commit will look into the hard delay at (E).

This patch uses a cluster setting instead of a more clever approach
based on extra parameters passed via the client request, or some other
cleverness based on the number of ranges active, so as to make the
patch lightweight and readily backportable. (Customers want this in
19.1/19.2 to enable no-downtime upgrade to 20.1).

```
adminServer server pgwire distsql storage
  -
  |
  ----------->o                           server.Undrain()
  ----------->-                           server.Drain() -> doDrain()
  .           |
  .           -                           CLIENT drain starts
  .           -                           grpc.setMode(draining)
  .           .
  .           =  (A)                      sleep server.shutdown.drain_wait (def 0s)
  .           .
  .           ------>-                    start draining pgwire clients
  .           .      |                    wait for drain of clients or server.shutdown.query_wait timeout (def 10s)
  .           .      .  (B)
  .           .      -                    cancel remaining SQL sessions
  .           -<------
  .           |
  .           ------------>-              ds.setDraining()
  .           .            |              start draining distsql flows
  .           .            .              wait for flows to drain, with minimum 1s andm max server.shutdown.query_wait (def 10s)
  .           .            .  (C)
  .           .            -              cancel remaining flows
  .           -<------------
  .           |
  .           -                           drain SQL table leases (performs KV requests)
  .           |
  .           -                           LEASES drain starts
  .           |
  .           ---------------------->o    nodeLiveness.SetDraining()
  .           ---------------------->-    for each Store s, SetDraining()
  .           .                      |
  .           .                      -    attempt transfering 100 leases
  .           .                      |
  .           .                      -    with exponential backoff (20ms->1s) repeat attempt to transfer 100 leases
  .           .                      .(D) try for at most server.shutdown.lease_transfer_wait (def 5s) <- NOW CONFIGURABLE
  .           .                      .
  .           -<----------------------    stop attempting lease transfers
  -<-----------
  -                                       stop here if client did not request Shutdown
  |
  -                                       grpc.Stop, stopper.Stop()
  .                                       start of 10 second timeout (non-configurable)
  . (E)
  -                                       os.Exit after timeout expires
```

Release note (general change): Prior to this patch, the phase of
server shutdown responsible for range lease transfers to other nodes
had a hard timeout of 5 seconds. This patch makes this timeout
configurable via the new cluster setting
`server.shutdown.lease_transfer_wait`.

Release note (backward incompatible change): Prior to this patch, the
phase of server shutdown responsible for range lease transfers to
other nodes would give up after 10000 attempts of transferring replica
leases away, regardless of the value of
`server.shutdown.lease_transfer_wait`. The limit of 10000 attempts has
been removed, so that now only the maximum duration
`server.shutdown.lease_transfer_wait` applies.
Release justification: fixes for high-priority or high-severity bugs in existing functionality

This is another step in a series of changes to make `cockroach quit`
more friendly with clusters containing many ranges, and enabling
graceful shutdowns when a separate service manager is responsible
for stopping processes.

This patch focuses on the client behavior, without considering the
behavior server-side.

For reference, here's the timeline of a `quit` call below.

This patch introduces two things:

- it makes the delay in step (X) configurable with `--drain-wait`,
  where it was not configurable previously. The default remains at 1
  minute.

- it makes it possible to drain without quitting with a new client
  command `cockroach node drain`, so that a DBA can look at the admin UI
  and satisfy themselves that the rest of the cluster has picked up the
  load.

This patch reuses the existing Drain() RPC parameters, so that a
`cockroach quit` or `cockroach node drain` client with this patch can
be used with a server running a previous version. This is aiming to
facilitate upgrades to the latest version on large clusters.

```
quit
 |
 -       start decommission...
 .       wait for decommission to complete (no timeout)
 .
 -       end of decommission.
 |
 -       start draining (no shutdown request)...
 .  (X)  wait for drain to complete or timeout specified with --drain-wait (NEW! def. 1m)
 .
 -       timeout expired or drain completed.
 |
 -       if 'cockroach node drain' (NEW!) was specified, stop here.
 |
 -       send shutdown request  (with no drain request)
 -       if 1st shutdown fail, try again.
 |
 -       end.
```

Additionally, the patch simplifies the drain/shutdown code as follows:

- it now uses context.WithTimeout / contextutil.RunWithTimeout instead of
  its own bespoke goroutine / time.After / select dance. (The previous
  code was implemented before gprc supported context cancellation.)

- it separates the code that sends a drain request from the code that
  sends a shutdown request, thereby dramatically reducing the number of
  conditionals.

- it provides more friendly error messages in case of timeouts and
  when attempting operations when the cluster has not been initialized
  yet.

Release note (backward-incompatible change): The textual error and
warning messages displayed by `cockroach quit` under various
circumstances have been updated. Meanwhile, the message "`ok`" remains
as indicator that the operation has likely succeeded.

Release note (cli change): The time that `cockroach quit` waits
client-side for the node to drain (remove existing clients and push
range leases away) is now configurable via the command-line flag
`--drain-wait`. Note that separate server-side timeouts also apply
separately, check the `server.shutdown.*` cluster settings for
details.

Release note (cli change): It is now possible to drain a node without
shutting down the process, using `cockroach node drain`. This makes it
easier to integrate with service managers and orchestration: it now
becomes safe to issue `cockroach node drain` and then separately stop
the service via a process manager or orchestrator. Without this new
mode, there is a risk to misconfigure the service manager to
auto-restart the node after it shuts down via `quit`, in a way that's
surprising or unwanted. The new command `node drain` also
recognizes the new `--drain-wait` flag.
Release justification: low risk, high benefit changes to existing functionality

Release note: None
Prior to this patch, the Drain RPC was providing much untested
flexibility to clients of the RPC: fields on the DrainRequest message
that had a lot of possible values, but where only a few of the
possible values ever made sense. This flexibility was not tested and I
have reasons to suspect that many combinations of parameters would not
work. See cockroachdb#46388 for
details and a discussion.

This patch removes the untested / unused flexibility and
simplifies the code accordingly:

- for requests:

  - the `Off` field is removed. It had never been used anyway.

  - the `On` field was only used ever with values `nil` (probe only =
    RPC health check) and `[0, 1]`.  it is now renamed to
    `DeprecatedProbeIndicator` and the server now asserts that a non-`nil`
    input is exactly `[0, 1]`. It is still recognized for
    compatibility with `cockroach quit` from previous versions.

  - a new bool field `DoDrain` replaces `On` to indicate whether the
    request should do a graceful drain or not.

- for responses:

  - the `On` field was only used ever with values `nil` (no
    drain in progress) and `[0, 1]` (drain currently in progress).
    The field is renamed `DeprecatedDrainStatus` for compatibility
    with hypothetical clients from previous versions (AFAIK,
    no client ever checked that field - maybe we could remove it
    altogether).

  - a new bool `IsDraining` reports the current draining status
    of the server.

This patch also introduces a server-side unit test for the Drain RPC,
which was missing previously.

Release justification: low risk, high benefit changes to existing functionality

Release note: None
Prior to this patch, it was impossible for the client of the Drain RPC
to know how much was left to do after a Drain completes. In particular,
it was impossible to determine whether a timeout was encountered and
some load could not be shed.

To improve upon this situation, this patch introduces a new field
"remaining indicator" in DrainResponse. This value remains non-zero
whenever the Drain has work to do or has performed work; its magnitude
indicates "how much work" is being performed during the drain and is
specified to decrease in successive drain invocations. When it reaches
0, the drain is known to have completed fully.

Example output:

Before:

```
  $ cockroach quit
  ok
```

After:

```
  $ cockroach quit
  node is draining; remaining: 12
  node is draining; remaining: 0 (complete)
  ok
```

Or alternatively:

```
  $ ./cockroach node drain
  node is draining; remaining: 12
  node is draining; remaining: 0 (complete)
  ok
```

As shown in these examples, `node drain` and `quit` now iterate
through drain requests until the remaining indicator reaches zero. The
total number of iterations remains controlled by `--drain-wait` in
aggregate (0 = no limit).

Details are also provided in the log, which can be enabled with an
opt-in explicit `--logtostderr=INFO` or some of the other log flags.
For example:

```
$ ./cockroach quit --logtostderr=INFO
I200331 23:02:05.146857 84 cli/start.go:1434  drain details: SQL clients: 1, table leases: 1, liveness record: 1, range leases: 24
node is shutting down; remaining: 27
node is shutting down; remaining: 0 (complete)
I200331 23:02:05.148888 1 util/stop/stopper.go:539  quiescing
ok
```

The unit test for this ascertains that:

- a first drain request on a test cluster returns a non-zero progress
  indicator. (There's always at least one unit of work - in the
  typical case that's the liveness record.)

- a second drain request returns a zero progress indicator.
  (Because under normal circumstances, the timeouts are never reached
  for an empty test clusters so the graceful drain should always
  be able to complete in the first request.)

Release justification: low risk, high benefit changes to existing functionality

Release note (cli change): The commands `cockroach quit` and
`cockroach node drain` now reports a "work remaining" metric on their
standard error stream. The value reduces until it reaches 0 to
indicate that the graceful shutdown has completed server-side. An
operator can now rely on `cockroach node drain` to obtain confidence
of a graceful shutdown prior to terminating the server process.

Release note (backward-incompatible change): `cockroach quit` now
prints out progress details on its standard error stream, even when
`--logtostderr` is not specified. Previously, nothing was printed on
standard error. Scripts who wish to ignore this output can redirect
the stderr stream.
Recommended by @tbg.

Release note: None
This is intended to aid troubleshooting faulty drains.

Release note: None
Prior to this patch, when the server closed the connection at the end
of the `Drain` RPC the gRPC client in `quit` would get confused with
either of the following:

```
ERROR: hard shutdown failed: rpc error: code = Unimplemented desc = grpc: Decompressor is not installed for grpc-encoding "snappy"
```

or

```
ERROR: hard shutdown failed: rpc error: code = Internal desc = grpc: compressed flag set with identity or empty encoding
```

This patch special-cases these two errors.

Release note: None
Release note (cli change): The default value of the paramater
`--drain-wait` for `cockroach quit` has been increased from 1 minute
to 10 minutes, to give more time for nodes with thousands of range to
migrate their leases away.
The call to `errors.Cause()` can return `nil` if the error was not
wrapped. It's wrapped always (by accident) in the 20.1 tree, but
that's not guaranteed/enforced.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants