server: restart a large Cockroach cluster with no impact to foreground traffic #66848

lunevalex · 2021-06-24T16:23:43Z

Is your feature request related to a problem? Please describe.

We recently observed a customer operating a cluster with 100 nodes. They needed to do a rolling restart of the cluster. To do that, they generated a shell script that went through all nodes, 1 at a time, then issued restart command, followed by (aggressive) 30 second sleep. This struck me as something that’s pretty risky, and not very user friendly. There are many issues here:

Running such commands on large scale cluster will take very long time (100 node/ w 1 min sleep: 100 minutes at least)
This is very error prone; you have to be connected all the time, you have to babysit this process, etc
Restarting multiple nodes at the same time is not possible. Some form of cockroach supported command (either directly from the binary, as in ./cockroach restart or through k8 automation) would be able to do much better job (for example, cockroach could select multiple nodes that host disjoint sets of ranges).
Restarting always results in a blip, particularly when restarting node hosting system ranges. This problem is captured here server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319.

Describe the solution you'd like

A single command is issued ./cockroach cluster restart, which cycles the whole cluster quickly and efficiently without any impact to foreground user traffic.

Ideally Cockroach would be smart enough to figure out how tor drain disjoint set of nodes quickly, and cycle through multiple nodes at a time, rather than doing it one at a time.

Describe alternatives you've considered

There is an argument to be made that this should live in an operator (i.e. k8s opeator) outside of CockroachDB.

Jira issue: CRDB-8257

knz · 2021-06-24T17:24:35Z

This needs an orchestration-level solution. CockroachDB nodes cannot "restart themselves".

github-actions · 2023-08-28T11:10:15Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2023-08-28T11:53:10Z

xref crdb launcher proposal

lunevalex added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 24, 2021

knz added A-server-start-drain Pertains to server startup and shutdown sequences A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021

github-actions bot added the no-issue-activity label Aug 28, 2023

knz removed the no-issue-activity label Aug 28, 2023

knz added the T-kv KV Team label Aug 28, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

lunevalex commented Jun 24, 2021 •

edited by cockroach-jira-scripts

Loading

knz commented Jun 24, 2021

github-actions bot commented Aug 28, 2023

knz commented Aug 28, 2023

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

Comments

lunevalex commented Jun 24, 2021 • edited by cockroach-jira-scripts Loading

knz commented Jun 24, 2021

github-actions bot commented Aug 28, 2023

knz commented Aug 28, 2023

lunevalex commented Jun 24, 2021 •

edited by cockroach-jira-scripts

Loading