server: restart a large Cockroach cluster with no impact to foreground traffic #66848
Labels
A-cc-enablement
Pertains to current CC production issues or short-term projects
A-server-start-drain
Pertains to server startup and shutdown sequences
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Is your feature request related to a problem? Please describe.
We recently observed a customer operating a cluster with 100 nodes. They needed to do a rolling restart of the cluster. To do that, they generated a shell script that went through all nodes, 1 at a time, then issued restart command, followed by (aggressive) 30 second sleep. This struck me as something that’s pretty risky, and not very user friendly. There are many issues here:
This is very error prone; you have to be connected all the time, you have to babysit this process, etc
Describe the solution you'd like
A single command is issued ./cockroach cluster restart, which cycles the whole cluster quickly and efficiently without any impact to foreground user traffic.
Ideally Cockroach would be smart enough to figure out how tor drain disjoint set of nodes quickly, and cycle through multiple nodes at a time, rather than doing it one at a time.
Describe alternatives you've considered
There is an argument to be made that this should live in an operator (i.e. k8s opeator) outside of CockroachDB.
Jira issue: CRDB-8257
The text was updated successfully, but these errors were encountered: