storage: enable seamless rolling restarts #44206

tbg · 2020-01-22T10:56:58Z

Rolling restarts are our recommended method for cluster upgrades. From a high level, the operator will perform the following steps sequentially on all nodes in the cluster:

take a node offline
perform required maintenance
restart the node

The goal of this strategy is not only to keep the cluster available throughout the process, but also to avoid spikes in latency.

Internally, taking a node offline gracefully entails a sequence of steps:

first, the grpc server is set to draining, which lets the this method return an error (if req.Ready):

cockroach/pkg/server/status.go

Lines 623 to 668 in 0e7f2f3

    
           // Details returns node details. 
        
           func (s *statusServer) Details( 
        
           	ctx context.Context, req *serverpb.DetailsRequest, 
        
           ) (*serverpb.DetailsResponse, error) { 
        
           	if _, err := s.admin.requireAdminUser(ctx); err != nil { 
        
           		return nil, err 
        
           	} 
        
           	ctx = propagateGatewayMetadata(ctx) 
        
           	ctx = s.AnnotateCtx(ctx) 
        
           	nodeID, local, err := s.parseNodeID(req.NodeId) 
        
           	if err != nil { 
        
           		return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error()) 
        
           	} 
        
           	telemetry.Inc(telemetryHealthCheck) 
        
           	if !local { 
        
           		status, err := s.dialNode(ctx, nodeID) 
        
           		if err != nil { 
        
           			return nil, err 
        
           		} 
        
           		return status.Details(ctx, req) 
        
           	} 
        
           	remoteNodeID := s.gossip.NodeID.Get() 
        
           	resp := &serverpb.DetailsResponse{ 
        
           		NodeID:     remoteNodeID, 
        
           		BuildInfo:  build.GetInfo(), 
        
           		SystemInfo: s.si.systemInfo(ctx), 
        
           	} 
        
           	if addr, err := s.gossip.GetNodeIDAddress(remoteNodeID); err == nil { 
        
           		resp.Address = *addr 
        
           	} 
        
           	if addr, err := s.gossip.GetNodeIDSQLAddress(remoteNodeID); err == nil { 
        
           		resp.SQLAddress = *addr 
        
           	} 
        
           	// If Ready is not set, the client doesn't want to know whether this node is 
        
           	// ready to receive client traffic. 
        
           	if !req.Ready { 
        
           		return resp, nil 
        
           	} 
        
           	serveMode := s.admin.server.grpc.mode.get() 
        
           	if serveMode != modeOperational { 
        
           		return nil, grpcstatus.Error(codes.Unavailable, "node is not ready") 
        
           	}

meaning that health checks will consider this node unhealthy (i.e. load balancers should not route new traffic to it).

sleep for drainWait (default 0s): https://github.com/cockroachdb/cockroach/blob/master/pkg/server/server.go#L2009, to give the above time
drain the pgServer, i.e. prevent new SQL connections and kill existing ones after queryWait:

cockroach/pkg/server/server.go

Lines 2016 to 2018 in c76ad97

if err := s.pgServer.Drain(drainMaxWait); err != nil {

return err

}
drain DistSQL, which entails gossiping a "draining" signal and tearing down flows after a timeout
drain the lease manager (i.e. drop table leases)

call into the KV layer to drain, i.e. attempt to transfer all range leaderships and raft leaderships away to other nodes (while at the same time refusing new ones), see

cockroach/pkg/server/server.go

Lines 1986 to 1999 in c76ad97

    
           func (s *Server) doDrain( 
        
           	ctx context.Context, modes []serverpb.DrainMode, setTo bool, 
        
           ) ([]serverpb.DrainMode, error) { 
        
           	for _, mode := range modes { 
        
           		switch mode { 
        
           		case serverpb.DrainMode_CLIENT: 
        
           			if setTo { 
        
           				s.grpc.setMode(modeDraining) 
        
           				// Wait for drainUnreadyWait. This will fail load balancer checks and 
        
           				// delay draining so that client traffic can move off this node. 
        
           				time.Sleep(drainWait.Get(&s.st.SV)) 
        
           			} 
        
           			if err := func() error { 
        
           				if !setTo {

and storage: transfer leases and leadership more thoroughly on graceful shutdown #44204.

So what does this look like from an operator's POV? "take a node offline" is reasonably easy - remove from load balancers; initiate a graceful shutdown; hard-kill after a generous timeout. But "restart the node" is trickier - what to wait for before taking down the next node? Operators would certainly wait for the readiness/health endpoints, but those essentially greenlight the node once it is live again (as measured by node liveness).

This is not enough. Consider a range in a cluster that lives on three nodes n1, n2, and n3. n1 is taken offline first; n2 and n3 continue to accept write traffic. Let's say they're going at full speed, meaning that when n1 joins them again it will take it seconds (potentially a dozen or so seconds) until it has caught up on the raft log. Now n1 is restarted and marks itself as live. It begins catching up on the log, but won't be up to date for another 10s or so. n2 gets taken down; now the range is unavailable: we have n3 (which has the latest entries and is now leader) but it can't commit anything until n1 has caught up, which will take at least another couple of seconds. From the operator's point of view, large write latencies are observed on this range.
A similar phenomenon occurs when the replica requires a snapshot after coming back up (just replace "catching up on the log" by "receiving a snapshot"), see #37906.

This "catching up" is not cleanly exposed via the readiness probes. Our current attempt at a workaround (used in MSO) is roughly to monitor the underreplicated metric. However, that metric only counts followers that are not considered live. As we've established above, followers will be live when they're ready anyway, rendering this probe moot.
We do have a behind metric, see

cockroach/pkg/storage/replica_metrics.go

Lines 122 to 126 in 26a612a

    
           // The raft leader computes the number of raft entries that replicas are 
        
           // behind. 
        
           if m.Leader { 
        
           	m.BehindCount = calcBehindCount(raftStatus, desc, livenessMap) 
        
           }

,

however this metric is not on/off - it's a number that cannot safely be thresholded (in a perfectly healthy deployment, this number could be in the thousands, especially when some followers are always lagging). This makes it unusable as a readiness probe.

Essentially, what we would want is for the recently restarted node to discover, for each of its ranges, whether it is catching up on parts of the raft log created before the restart. This is not an easy problem to solve, though perhaps we can introduce a variant of the behind metric that can do the job. Roughly speaking, we want to track, per-range, the timestamp at which a proposal was last fully replicated (if no proposal is outstanding and all followers have caught up, it's treated as now()). Taking the min of that over all ranges and waiting for this to surpass the time at which the node was restarted, we know that all ranges have either caught up past the downtime or are fully caught up.

Armed with this, it remains to figure out the UX. Obtaining the timestamp on restart is easy, but comparing the metrics is finicky - they're stored per-node (plus in the tsKV store) so we'll have to write some code to perform the validation. I am of the opinion, however, that we must not leave this work to the operator - we should not need to ask users to run certain internal SQL queries against each node and wait for a particular result. It should be as easy as restarting the node and waiting for a generic readiness signal.

cc @johnrk
cc @lnhsingh
cc @joshimhoff - do you see anything I missed there or is what CC is doing right now different from what I describe above?

The text was updated successfully, but these errors were encountered:

joshimhoff · 2020-01-22T15:35:22Z

cc @DuskEagle (author of our rollout automation) for the Q about SRE perspective on this issue.

Thanks for opening it, Tobias!

johnrk-zz · 2020-01-22T16:39:48Z

This work is now in our backlog: https://airtable.com/tblD3oZPLJgGhCmch/viwsaabU5mS5NCizt/recolcAU6vvSgiIoZ?blocks=hide

ajwerner · 2020-04-13T19:48:35Z

@knz to what extent has your recent work been in support of this?

knz · 2020-04-13T20:26:23Z

It supports 60% of it. Andrei and I are currently working on the remaining 40%.

knz · 2020-08-31T17:56:40Z

Subsuming this under #50199, #37906 and the remainder of the work described in #52593.

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Jan 22, 2020

tbg assigned knz and tbg Jan 22, 2020

joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Jan 22, 2020

tbg mentioned this issue Feb 19, 2020

CockroachDB: Health checks time out under load #44832

Closed

irfansharif mentioned this issue Apr 17, 2020

kv: ranges don't get upreplicated, despite other nodes being around #47620

Closed

knz mentioned this issue May 8, 2020

cli: cockroach quit should only succeed as long as ranges will not become unavailable #43412

Closed

piyush-singh mentioned this issue May 27, 2020

Documentation for HTTP endpoints cockroachdb/docs#7357

Closed

knz closed this as completed Aug 31, 2020

piyush-singh mentioned this issue Oct 20, 2020

api: make an API endpoint that returns success when a node can be safely shut down #55768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: enable seamless rolling restarts #44206

storage: enable seamless rolling restarts #44206

tbg commented Jan 22, 2020 •

edited

Loading

joshimhoff commented Jan 22, 2020

johnrk-zz commented Jan 22, 2020

ajwerner commented Apr 13, 2020

knz commented Apr 13, 2020

knz commented Aug 31, 2020

storage: enable seamless rolling restarts #44206

storage: enable seamless rolling restarts #44206

Comments

tbg commented Jan 22, 2020 • edited Loading

joshimhoff commented Jan 22, 2020

johnrk-zz commented Jan 22, 2020

ajwerner commented Apr 13, 2020

knz commented Apr 13, 2020

knz commented Aug 31, 2020

tbg commented Jan 22, 2020 •

edited

Loading