Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: cockroach quit should only succeed as long as ranges will not become unavailable #43412

Closed
rkruze opened this issue Dec 20, 2019 · 5 comments
Assignees

Comments

@rkruze
Copy link

rkruze commented Dec 20, 2019

Is your feature request related to a problem? Please describe.
Currently, today users have to look at several different places to make sure that taking a node down in a Cockroach cluster will not create unavailable ranges. Namely today, a user would have to check system.replication_stats.under_replicated_ranges to make sure there are no under replicated ranges and check to make sure all nodes show up as is_live in crdb_internal.gossip_nodes.

Describe the solution you'd like
Instead, the cockroach quit command should not return successfully if taking a node down would make data unavailable.

@piyush-singh piyush-singh self-assigned this Dec 20, 2019
@ajwerner
Copy link
Contributor

ajwerner commented Jan 8, 2020

cc @johnrk @andy-kimball for triage

@andy-kimball
Copy link
Contributor

This issue will need discussion about how to handle cases where all nodes are brought down at once. Does cockroach quit just hang when that is attempted? Does it fail with an error after some timeout? Does it go ahead with a forcing shutdown after some timeout? This presumably should rarely, if ever happen in production, but we need to still have expected behavior.

@piyush-singh
Copy link

Andy's question mirrors a common problem we see with decommissioning nodes - often users will be confused about why a node they are decommissioning doesn't appear to be draining. After some investigation, we'll realize the problem is that they are trying to decommission a node with a 3x replication factor and only 3 nodes. The decommission starts and then hangs since there isn't a location for the replicas to go. We should think about how to warn users about this for cockroach quit as well if we are changing the behavior.

@knz
Copy link
Contributor

knz commented May 8, 2020

Update on the original issue at top:

  • we have a case today where a node leaving the cluster can leave replicas behind that are lagging in raft and cannot serve requests right away. This needs to be fixed (priority project for 20.2), presumably by delaying the drain until they catch up.
  • there used to be a case where quit would let the node shut down with still some range leases on it, but that is not the case any more (fixed 19.1.9/19.2.7/20.1.1)

Update on the specific questions about cockroach quit:

  • cockroach quit now refuses graceful shutdown if there's not enough nodes to accept the lease
  • there's a timeout (configurable with --drain-wait)
  • in order to avoid surprises users are invited to use the new command cockroach node drain. If that command does not terminate successfully, then terminating the server process is not safe.

@knz
Copy link
Contributor

knz commented May 8, 2020

The remaining work to be done (identified by my first bullet point above) is exactly the work described in the top description for #44206, so I'll close this issue in favor of that one.

@knz knz closed this as completed May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants