Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: enable seamless rolling restarts #44206

Closed
tbg opened this issue Jan 22, 2020 · 5 comments
Closed

storage: enable seamless rolling restarts #44206

tbg opened this issue Jan 22, 2020 · 5 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@tbg
Copy link
Member

tbg commented Jan 22, 2020

Rolling restarts are our recommended method for cluster upgrades. From a high level, the operator will perform the following steps sequentially on all nodes in the cluster:

  • take a node offline
  • perform required maintenance
  • restart the node

The goal of this strategy is not only to keep the cluster available throughout the process, but also to avoid spikes in latency.

Internally, taking a node offline gracefully entails a sequence of steps:

  • first, the grpc server is set to draining, which lets the this method return an error (if req.Ready):
    // Details returns node details.
    func (s *statusServer) Details(
    ctx context.Context, req *serverpb.DetailsRequest,
    ) (*serverpb.DetailsResponse, error) {
    if _, err := s.admin.requireAdminUser(ctx); err != nil {
    return nil, err
    }
    ctx = propagateGatewayMetadata(ctx)
    ctx = s.AnnotateCtx(ctx)
    nodeID, local, err := s.parseNodeID(req.NodeId)
    if err != nil {
    return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error())
    }
    telemetry.Inc(telemetryHealthCheck)
    if !local {
    status, err := s.dialNode(ctx, nodeID)
    if err != nil {
    return nil, err
    }
    return status.Details(ctx, req)
    }
    remoteNodeID := s.gossip.NodeID.Get()
    resp := &serverpb.DetailsResponse{
    NodeID: remoteNodeID,
    BuildInfo: build.GetInfo(),
    SystemInfo: s.si.systemInfo(ctx),
    }
    if addr, err := s.gossip.GetNodeIDAddress(remoteNodeID); err == nil {
    resp.Address = *addr
    }
    if addr, err := s.gossip.GetNodeIDSQLAddress(remoteNodeID); err == nil {
    resp.SQLAddress = *addr
    }
    // If Ready is not set, the client doesn't want to know whether this node is
    // ready to receive client traffic.
    if !req.Ready {
    return resp, nil
    }
    serveMode := s.admin.server.grpc.mode.get()
    if serveMode != modeOperational {
    return nil, grpcstatus.Error(codes.Unavailable, "node is not ready")
    }
    meaning that health checks will consider this node unhealthy (i.e. load balancers should not route new traffic to it).
  • sleep for drainWait (default 0s): https://github.com/cockroachdb/cockroach/blob/master/pkg/server/server.go#L2009, to give the above time
  • drain the pgServer, i.e. prevent new SQL connections and kill existing ones after queryWait:

    cockroach/pkg/server/server.go

    Lines 2016 to 2018 in c76ad97

    if err := s.pgServer.Drain(drainMaxWait); err != nil {
    return err
    }
  • drain DistSQL, which entails gossiping a "draining" signal and tearing down flows after a timeout
  • drain the lease manager (i.e. drop table leases)
  • call into the KV layer to drain, i.e. attempt to transfer all range leaderships and raft leaderships away to other nodes (while at the same time refusing new ones), see

    cockroach/pkg/server/server.go

    Lines 1986 to 1999 in c76ad97

    func (s *Server) doDrain(
    ctx context.Context, modes []serverpb.DrainMode, setTo bool,
    ) ([]serverpb.DrainMode, error) {
    for _, mode := range modes {
    switch mode {
    case serverpb.DrainMode_CLIENT:
    if setTo {
    s.grpc.setMode(modeDraining)
    // Wait for drainUnreadyWait. This will fail load balancer checks and
    // delay draining so that client traffic can move off this node.
    time.Sleep(drainWait.Get(&s.st.SV))
    }
    if err := func() error {
    if !setTo {
    and storage: transfer leases and leadership more thoroughly on graceful shutdown #44204.

So what does this look like from an operator's POV? "take a node offline" is reasonably easy - remove from load balancers; initiate a graceful shutdown; hard-kill after a generous timeout. But "restart the node" is trickier - what to wait for before taking down the next node? Operators would certainly wait for the readiness/health endpoints, but those essentially greenlight the node once it is live again (as measured by node liveness).

This is not enough. Consider a range in a cluster that lives on three nodes n1, n2, and n3. n1 is taken offline first; n2 and n3 continue to accept write traffic. Let's say they're going at full speed, meaning that when n1 joins them again it will take it seconds (potentially a dozen or so seconds) until it has caught up on the raft log. Now n1 is restarted and marks itself as live. It begins catching up on the log, but won't be up to date for another 10s or so. n2 gets taken down; now the range is unavailable: we have n3 (which has the latest entries and is now leader) but it can't commit anything until n1 has caught up, which will take at least another couple of seconds. From the operator's point of view, large write latencies are observed on this range.
A similar phenomenon occurs when the replica requires a snapshot after coming back up (just replace "catching up on the log" by "receiving a snapshot"), see #37906.

This "catching up" is not cleanly exposed via the readiness probes. Our current attempt at a workaround (used in MSO) is roughly to monitor the underreplicated metric. However, that metric only counts followers that are not considered live. As we've established above, followers will be live when they're ready anyway, rendering this probe moot.
We do have a behind metric, see

// The raft leader computes the number of raft entries that replicas are
// behind.
if m.Leader {
m.BehindCount = calcBehindCount(raftStatus, desc, livenessMap)
}
,

however this metric is not on/off - it's a number that cannot safely be thresholded (in a perfectly healthy deployment, this number could be in the thousands, especially when some followers are always lagging). This makes it unusable as a readiness probe.

Essentially, what we would want is for the recently restarted node to discover, for each of its ranges, whether it is catching up on parts of the raft log created before the restart. This is not an easy problem to solve, though perhaps we can introduce a variant of the behind metric that can do the job. Roughly speaking, we want to track, per-range, the timestamp at which a proposal was last fully replicated (if no proposal is outstanding and all followers have caught up, it's treated as now()). Taking the min of that over all ranges and waiting for this to surpass the time at which the node was restarted, we know that all ranges have either caught up past the downtime or are fully caught up.

Armed with this, it remains to figure out the UX. Obtaining the timestamp on restart is easy, but comparing the metrics is finicky - they're stored per-node (plus in the tsKV store) so we'll have to write some code to perform the validation. I am of the opinion, however, that we must not leave this work to the operator - we should not need to ask users to run certain internal SQL queries against each node and wait for a particular result. It should be as easy as restarting the node and waiting for a generic readiness signal.

cc @johnrk
cc @lnhsingh
cc @joshimhoff - do you see anything I missed there or is what CC is doing right now different from what I describe above?

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Jan 22, 2020
@tbg tbg assigned knz and tbg Jan 22, 2020
@joshimhoff
Copy link
Collaborator

cc @DuskEagle (author of our rollout automation) for the Q about SRE perspective on this issue.

Thanks for opening it, Tobias!

@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Jan 22, 2020
@johnrk-zz
Copy link

@ajwerner
Copy link
Contributor

@knz to what extent has your recent work been in support of this?

@knz
Copy link
Contributor

knz commented Apr 13, 2020

It supports 60% of it. Andrei and I are currently working on the remaining 40%.

@knz
Copy link
Contributor

knz commented Aug 31, 2020

Subsuming this under #50199, #37906 and the remainder of the work described in #52593.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
Development

No branches or pull requests

5 participants