Skip to content

Commit

Permalink
demo: wait for stopper to stop on shutdown
Browse files Browse the repository at this point in the history
Previously, attempting to quickly shutdown and restart a node using
the demo `\demo shutdown` and `\demo restart` command would result in
an error:

    ERROR: internal server error: failed to create engines: resource
    temporarily unavailable

This was revealed in the multi-tenant tests but is not multi-tenant
specific and happens in non-mt demo clusters as well.

The `\demo shutdown` command sends a Drain request with `Shutdown:
true`. However, as part of the shutdown, the gRPC server is shut down,
which will return an error to the client (which is ignored) before the
entire server is shutdown.

There is a TODO in the code from 2019 asking why we don't shut down
gRPC later in the process.

This is a demo cluster specific fix in which we wait on the stopper
client side since we happen to have a reference to it. This is
intended for backport so the disabled test can be re-enabled on the
release branches.

It's likely that we can do a larger fix in which we leave the drain
server running for as long as possible. But, the blast radius of such
a change would not be suitable for backport given the low-impact of
the problem.

Fixes: #110748

Release note: None
  • Loading branch information
stevendanna committed Oct 27, 2023
1 parent e041077 commit a51410a
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 4 deletions.
7 changes: 7 additions & 0 deletions pkg/acceptance/generated_cli_test.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions pkg/cli/democluster/demo_cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -1060,6 +1060,13 @@ func (c *transientCluster) DrainAndShutdown(ctx context.Context, nodeID int32) e
if err := c.drainAndShutdown(ctx, c.servers[serverIdx].adminClient); err != nil {
return err
}

select {
case <-c.servers[serverIdx].Stopper().IsStopped():
case <-time.After(10 * time.Second):
return errors.Errorf("server stopper not stopped after 10 seconds")
}

c.servers[serverIdx].TestServerInterface = nil
c.servers[serverIdx].adminClient = nil
if c.demoCtx.Multitenant {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
#! /usr/bin/env expect -f

# This test is skipped -- its filename lets it hide from the selector in
# TestDockerCLI. Unskip it by renaming after fixing
# https://github.com/cockroachdb/cockroach/issues/110748.

source [file join [file dirname $argv0] common.tcl]

spawn $argv demo --no-line-editor --empty --nodes 3 --multitenant --log-dir=logs
Expand Down

0 comments on commit a51410a

Please sign in to comment.