server,kv: IsLive() does not check the Draining flag #45123

knz · 2020-02-14T21:15:43Z

The method (*NodeLiveness).IsLive(nodeID) does not check the Draining flag.

So it's possible for this to return "true" while a node is in the process of shutting down.

(Found this when looking at the /health RPC, which says isHealthy := IsLive() && !Draining)

The IsLive() method is used in only the following two places (besides /health):

(q *consistencyQueue) shouldQueue()
(p *pendingLeaseRequest) requestLeaseAsync()

If my reading of the code is correct, this means that a node currently in the process of quitting (and thus getting read of its leases) can accept to start scanning ranges for consistency, and request leases for some leaderless range.

The latter in particular is probably undesirable: if the quitting node currently does not have a lease, and gets it and then does not release it on time when it finishes shutting down, some clients may experience a blip in latency.

It sounds to me like the IsLive() method should combine the Draining flag in any case.

cc @tbg can you check this? If yes I can issue the PR.

Jira issue: CRDB-5177

The text was updated successfully, but these errors were encountered:

tbg · 2020-02-19T14:01:39Z

Confusing naming aside, a liveness record should be considered live even if it is draining. Take a look at this code:

cockroach/pkg/storage/replica_range_lease.go

Lines 346 to 352 in 6bf028c

    
           if live, liveErr := p.repl.store.cfg.NodeLiveness.IsLive(nextLeaseHolder.NodeID); !live || liveErr != nil { 
        
           	err = errors.Errorf("not incrementing epoch on n%d because next leaseholder (n%d) not live (err = %v)", 
        
           		status.Liveness.NodeID, nextLeaseHolder.NodeID, liveErr) 
        
           	if log.V(1) { 
        
           		log.Info(ctx, err) 
        
           	} 
        
           } else if err = p.repl.store.cfg.NodeLiveness.IncrementEpoch(ctx, status.Liveness); err != nil {

We are allowed to increment the liveness record's epoch (i.e. kick it out) when it is non-live. This is essentially a hostile take-over reserved for cases in which the old leaseholder becomes unavailable. Doing this before the liveness is actually expired would do little more than disrupt the cluster by making sure nobody can serve requests until the lease is expired.

Taking a step back, the problem is that many consumers of node liveness expect it to provide "liveness" in a more holistic sense, which it currently does not do. I think what we need to do here is to make everyone consume an API that deals in this enum:

cockroach/pkg/storage/storagepb/liveness.pb.go

Lines 32 to 47 in 1d76206

    
           const ( 
        
           	NodeLivenessStatus_UNKNOWN NodeLivenessStatus = 0 
        
           	// DEAD indicates the node is considered dead. 
        
           	NodeLivenessStatus_DEAD NodeLivenessStatus = 1 
        
           	// UNAVAILABLE indicates that the node is unavailable - it has not updated its 
        
           	// liveness record recently enough to be considered live, but has not been 
        
           	// unavailable long enough to be considered dead. 
        
           	NodeLivenessStatus_UNAVAILABLE NodeLivenessStatus = 2 
        
           	// LIVE indicates a live node. 
        
           	NodeLivenessStatus_LIVE NodeLivenessStatus = 3 
        
           	// DECOMMISSIONING indicates a node that is in the decommissioning process. 
        
           	NodeLivenessStatus_DECOMMISSIONING NodeLivenessStatus = 4 
        
           	// DECOMMISSIONED indicates a node that has finished the decommissioning 
        
           	// process. 
        
           	NodeLivenessStatus_DECOMMISSIONED NodeLivenessStatus = 5 
        
           )

currently used in this randomly placed method

cockroach/pkg/storage/store_pool.go

Lines 134 to 153 in 1d76206

    
           func LivenessStatus( 
        
           	l storagepb.Liveness, now time.Time, deadThreshold time.Duration, 
        
           ) storagepb.NodeLivenessStatus { 
        
           	if l.IsDead(now, deadThreshold) { 
        
           		if l.Decommissioning { 
        
           			return storagepb.NodeLivenessStatus_DECOMMISSIONED 
        
           		} 
        
           		return storagepb.NodeLivenessStatus_DEAD 
        
           	} 
        
           	if l.Decommissioning { 
        
           		return storagepb.NodeLivenessStatus_DECOMMISSIONING 
        
           	} 
        
           	if l.Draining { 
        
           		return storagepb.NodeLivenessStatus_UNAVAILABLE 
        
           	} 
        
           	if l.IsLive(now) { 
        
           		return storagepb.NodeLivenessStatus_LIVE 
        
           	} 
        
           	return storagepb.NodeLivenessStatus_UNAVAILABLE 
        
           }

and also picked up in some http endpoints via

cockroach/pkg/server/admin.go

Lines 1263 to 1273 in f1362c4

    
           // getLivenessStatusMap generates a map from NodeID to LivenessStatus for all 
        
           // nodes known to gossip. Nodes that haven't pinged their liveness record for 
        
           // more than server.time_until_store_dead are considered dead. 
        
           // 
        
           // To include all nodes (including ones not in the gossip network), callers 
        
           // should consider calling (statusServer).NodesWithLiveness() instead where 
        
           // possible. 
        
           // 
        
           // getLivenessStatusMap() includes removed nodes (dead + decommissioned). 
        
           func getLivenessStatusMap( 
        
           	nl *storage.NodeLiveness, now time.Time, st *cluster.Settings,

.

We should stop passing NodeLiveness and Liveness around as much as we do know. We can wrap it in something that exposes a much narrower API that is actually useful.

There could be a state DRAINING in the enum which is used instead of LIVE if the Draining flag is set in addition. This hasn't been relevant so far because our draining is pretty short and the readiness endpoint already gets the draining signal from the (*Server).grpc.mode, but it might be good to expose it now that our draining periods may become longer.

Thanks for reminding me of the work to do here. The interfaces here are quite lacking, I hope my explanations make some sense.

andreimatei · 2020-02-19T14:47:45Z

Beware of that enum. Call site after call site was using it wrong because there's too many states and different people care about different subsets, plus different timeout thresholds. I've corrected a few and in fact worked to reduce reliance on it in favor of more raw usage of the Liveness struct.

tbg · 2020-02-19T17:50:04Z

@andreimatei can you give me some examples (not urgent)? You seem to have come to the opposite of my conclusion so we must have something to learn from each other.

andreimatei · 2020-02-19T18:38:27Z

Well, just look at the diagram I put on LivenessStatus and how the transitions depend on the draining flag to see that the statuses are complicated.

It's been my experience that different people care about different things. For some draining makes a different, for other draining is irrelevant. Similarly, decomissioning might be relevant or not. Then there's the deadThreshold business, which is an extra dimension on top of the already confusing many states; when one sees so many excruciatingly exhaustive states, you really don't expect to be able to customize them with some interval. Then DECOMISSIONED/DEAD are hard to understand, and depending on how one got its hand on a liveness record, these states might or might not even be possible.
I think we should force everybody to be as explicit as possible about what they care about and what they don't.

For examples of someone getting it wrong, see the commit msg on #43825 and #43873. And I feel like I've massaged others too.

andreimatei · 2020-02-19T18:47:10Z

Also see for a fix I've made specifically to the call site that prompted this conversation, which was getting decommissioning wrong. #43889

66632: *: check node decommissioned/draining state for DistSQL/consistency r=tbg,knz a=erikgrinaker The DistSQL planner and consistency queue did not take the nodes' decommissioned or draining states into account, which in particular could cause spurious errors when interacting with decommissioned nodes. This patch adds convenience methods for checking node availability and draining states, and avoids scheduling DistSQL flows on unavailable nodes and consistency checks on unavailable/draining nodes. Touches #66586, touches #45123. Release note (bug fix): Avoid interacting with decommissioned nodes during DistSQL planning and consistency checking. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>

AlexTalks · 2023-08-21T16:13:18Z

This issue may be moot now that #105572 is merged I think.

knz added C-investigation Further steps needed to qualify. C-label will change. A-kv-gossip A-kv-server Relating to the KV-level RPC server labels Feb 14, 2020

This was referenced Feb 19, 2020

server: fix broken /health #45119

Merged

CockroachDB: Health checks time out under load #44832

Closed

lunevalex added the A-kv-decom-rolling-restart Decommission and Rolling Restarts label Jul 28, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker mentioned this issue Jun 21, 2021

*: check node decommissioned/draining state for DistSQL/consistency #66632

Merged

This was referenced Jun 28, 2021

release-21.1: *: check node decommissioned/draining state for DistSQL/consistency #66950

Merged

release-20.2: *: check node decommissioned/draining state for DistSQL/consistency #66951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,kv: IsLive() does not check the Draining flag #45123

server,kv: IsLive() does not check the Draining flag #45123

knz commented Feb 14, 2020 •

edited by cockroach-jira-scripts

Loading

tbg commented Feb 19, 2020

andreimatei commented Feb 19, 2020 •

edited

Loading

tbg commented Feb 19, 2020

andreimatei commented Feb 19, 2020

andreimatei commented Feb 19, 2020

AlexTalks commented Aug 21, 2023

server,kv: IsLive() does not check the Draining flag #45123

server,kv: IsLive() does not check the Draining flag #45123

Comments

knz commented Feb 14, 2020 • edited by cockroach-jira-scripts Loading

tbg commented Feb 19, 2020

andreimatei commented Feb 19, 2020 • edited Loading

tbg commented Feb 19, 2020

andreimatei commented Feb 19, 2020

andreimatei commented Feb 19, 2020

AlexTalks commented Aug 21, 2023

knz commented Feb 14, 2020 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Feb 19, 2020 •

edited

Loading