storage: enable seamless rolling restarts #44206
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-sre
For issues SRE opened or otherwise cares about tracking.
Rolling restarts are our recommended method for cluster upgrades. From a high level, the operator will perform the following steps sequentially on all nodes in the cluster:
The goal of this strategy is not only to keep the cluster available throughout the process, but also to avoid spikes in latency.
Internally, taking a node offline gracefully entails a sequence of steps:
req.Ready
):cockroach/pkg/server/status.go
Lines 623 to 668 in 0e7f2f3
cockroach/pkg/server/server.go
Lines 2016 to 2018 in c76ad97
cockroach/pkg/server/server.go
Lines 1986 to 1999 in c76ad97
So what does this look like from an operator's POV? "take a node offline" is reasonably easy - remove from load balancers; initiate a graceful shutdown; hard-kill after a generous timeout. But "restart the node" is trickier - what to wait for before taking down the next node? Operators would certainly wait for the readiness/health endpoints, but those essentially greenlight the node once it is live again (as measured by node liveness).
This is not enough. Consider a range in a cluster that lives on three nodes n1, n2, and n3. n1 is taken offline first; n2 and n3 continue to accept write traffic. Let's say they're going at full speed, meaning that when n1 joins them again it will take it seconds (potentially a dozen or so seconds) until it has caught up on the raft log. Now n1 is restarted and marks itself as live. It begins catching up on the log, but won't be up to date for another 10s or so. n2 gets taken down; now the range is unavailable: we have n3 (which has the latest entries and is now leader) but it can't commit anything until n1 has caught up, which will take at least another couple of seconds. From the operator's point of view, large write latencies are observed on this range.
A similar phenomenon occurs when the replica requires a snapshot after coming back up (just replace "catching up on the log" by "receiving a snapshot"), see #37906.
This "catching up" is not cleanly exposed via the readiness probes. Our current attempt at a workaround (used in MSO) is roughly to monitor the
underreplicated
metric. However, that metric only counts followers that are not considered live. As we've established above, followers will be live when they're ready anyway, rendering this probe moot.We do have a
behind
metric, seecockroach/pkg/storage/replica_metrics.go
Lines 122 to 126 in 26a612a
however this metric is not on/off - it's a number that cannot safely be thresholded (in a perfectly healthy deployment, this number could be in the thousands, especially when some followers are always lagging). This makes it unusable as a readiness probe.
Essentially, what we would want is for the recently restarted node to discover, for each of its ranges, whether it is catching up on parts of the raft log created before the restart. This is not an easy problem to solve, though perhaps we can introduce a variant of the
behind
metric that can do the job. Roughly speaking, we want to track, per-range, the timestamp at which a proposal was last fully replicated (if no proposal is outstanding and all followers have caught up, it's treated asnow()
). Taking the min of that over all ranges and waiting for this to surpass the time at which the node was restarted, we know that all ranges have either caught up past the downtime or are fully caught up.Armed with this, it remains to figure out the UX. Obtaining the timestamp on restart is easy, but comparing the metrics is finicky - they're stored per-node (plus in the tsKV store) so we'll have to write some code to perform the validation. I am of the opinion, however, that we must not leave this work to the operator - we should not need to ask users to run certain internal SQL queries against each node and wait for a particular result. It should be as easy as restarting the node and waiting for a generic readiness signal.
cc @johnrk
cc @lnhsingh
cc @joshimhoff - do you see anything I missed there or is what CC is doing right now different from what I describe above?
The text was updated successfully, but these errors were encountered: