cli: `cockroach quit` should only succeed as long as ranges will not become unavailable #43412

rkruze · 2019-12-20T17:52:38Z

Is your feature request related to a problem? Please describe.
Currently, today users have to look at several different places to make sure that taking a node down in a Cockroach cluster will not create unavailable ranges. Namely today, a user would have to check system.replication_stats.under_replicated_ranges to make sure there are no under replicated ranges and check to make sure all nodes show up as is_live in crdb_internal.gossip_nodes.

Describe the solution you'd like
Instead, the cockroach quit command should not return successfully if taking a node down would make data unavailable.

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-01-08T21:05:17Z

cc @johnrk @andy-kimball for triage

andy-kimball · 2020-01-08T21:39:18Z

This issue will need discussion about how to handle cases where all nodes are brought down at once. Does cockroach quit just hang when that is attempted? Does it fail with an error after some timeout? Does it go ahead with a forcing shutdown after some timeout? This presumably should rarely, if ever happen in production, but we need to still have expected behavior.

piyush-singh · 2020-01-09T23:02:41Z

Andy's question mirrors a common problem we see with decommissioning nodes - often users will be confused about why a node they are decommissioning doesn't appear to be draining. After some investigation, we'll realize the problem is that they are trying to decommission a node with a 3x replication factor and only 3 nodes. The decommission starts and then hangs since there isn't a location for the replicas to go. We should think about how to warn users about this for cockroach quit as well if we are changing the behavior.

knz · 2020-05-08T16:56:57Z

Update on the original issue at top:

we have a case today where a node leaving the cluster can leave replicas behind that are lagging in raft and cannot serve requests right away. This needs to be fixed (priority project for 20.2), presumably by delaying the drain until they catch up.
there used to be a case where quit would let the node shut down with still some range leases on it, but that is not the case any more (fixed 19.1.9/19.2.7/20.1.1)

Update on the specific questions about cockroach quit:

cockroach quit now refuses graceful shutdown if there's not enough nodes to accept the lease
there's a timeout (configurable with --drain-wait)
in order to avoid surprises users are invited to use the new command cockroach node drain. If that command does not terminate successfully, then terminating the server process is not safe.

knz · 2020-05-08T17:01:36Z

The remaining work to be done (identified by my first bullet point above) is exactly the work described in the top description for #44206, so I'll close this issue in favor of that one.

piyush-singh self-assigned this Dec 20, 2019

piyush-singh added the A-cli label Dec 20, 2019

knz closed this as completed May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: `cockroach quit` should only succeed as long as ranges will not become unavailable #43412

cli: `cockroach quit` should only succeed as long as ranges will not become unavailable #43412

rkruze commented Dec 20, 2019

ajwerner commented Jan 8, 2020

andy-kimball commented Jan 8, 2020

piyush-singh commented Jan 9, 2020

knz commented May 8, 2020

knz commented May 8, 2020

cli: cockroach quit should only succeed as long as ranges will not become unavailable #43412

cli: cockroach quit should only succeed as long as ranges will not become unavailable #43412

Comments

rkruze commented Dec 20, 2019

ajwerner commented Jan 8, 2020

andy-kimball commented Jan 8, 2020

piyush-singh commented Jan 9, 2020

knz commented May 8, 2020

knz commented May 8, 2020

cli: `cockroach quit` should only succeed as long as ranges will not become unavailable #43412

cli: `cockroach quit` should only succeed as long as ranges will not become unavailable #43412