Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

Open
lunevalex opened this issue Jun 24, 2021 · 3 comments
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-server-start-drain Pertains to server startup and shutdown sequences C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@lunevalex
Copy link
Collaborator

lunevalex commented Jun 24, 2021

Is your feature request related to a problem? Please describe.

We recently observed a customer operating a cluster with 100 nodes. They needed to do a rolling restart of the cluster. To do that, they generated a shell script that went through all nodes, 1 at a time, then issued restart command, followed by (aggressive) 30 second sleep. This struck me as something that’s pretty risky, and not very user friendly. There are many issues here:

  • Running such commands on large scale cluster will take very long time (100 node/ w 1 min sleep: 100 minutes at least)
    This is very error prone; you have to be connected all the time, you have to babysit this process, etc
  • Restarting multiple nodes at the same time is not possible. Some form of cockroach supported command (either directly from the binary, as in ./cockroach restart or through k8 automation) would be able to do much better job (for example, cockroach could select multiple nodes that host disjoint sets of ranges).
  • Restarting always results in a blip, particularly when restarting node hosting system ranges. This problem is captured here server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319.

Describe the solution you'd like

A single command is issued ./cockroach cluster restart, which cycles the whole cluster quickly and efficiently without any impact to foreground user traffic.

Ideally Cockroach would be smart enough to figure out how tor drain disjoint set of nodes quickly, and cycle through multiple nodes at a time, rather than doing it one at a time.

Describe alternatives you've considered

There is an argument to be made that this should live in an operator (i.e. k8s opeator) outside of CockroachDB.

Jira issue: CRDB-8257

@lunevalex lunevalex added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 24, 2021
@knz
Copy link
Contributor

knz commented Jun 24, 2021

This needs an orchestration-level solution. CockroachDB nodes cannot "restart themselves".

@knz knz added A-server-start-drain Pertains to server startup and shutdown sequences A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Aug 28, 2023

xref crdb launcher proposal

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-server-start-drain Pertains to server startup and shutdown sequences C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants