Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,kv: IsLive() does not check the Draining flag #45123

Open
knz opened this issue Feb 14, 2020 · 6 comments
Open

server,kv: IsLive() does not check the Draining flag #45123

knz opened this issue Feb 14, 2020 · 6 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-gossip A-kv-server Relating to the KV-level RPC server C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team

Comments

@knz
Copy link
Contributor

knz commented Feb 14, 2020

The method (*NodeLiveness).IsLive(nodeID) does not check the Draining flag.

So it's possible for this to return "true" while a node is in the process of shutting down.

(Found this when looking at the /health RPC, which says isHealthy := IsLive() && !Draining)

The IsLive() method is used in only the following two places (besides /health):

  • (q *consistencyQueue) shouldQueue()
  • (p *pendingLeaseRequest) requestLeaseAsync()

If my reading of the code is correct, this means that a node currently in the process of quitting (and thus getting read of its leases) can accept to start scanning ranges for consistency, and request leases for some leaderless range.

The latter in particular is probably undesirable: if the quitting node currently does not have a lease, and gets it and then does not release it on time when it finishes shutting down, some clients may experience a blip in latency.

It sounds to me like the IsLive() method should combine the Draining flag in any case.

cc @tbg can you check this? If yes I can issue the PR.

Jira issue: CRDB-5177

@knz knz added C-investigation Further steps needed to qualify. C-label will change. A-kv-gossip A-kv-server Relating to the KV-level RPC server labels Feb 14, 2020
@tbg
Copy link
Member

tbg commented Feb 19, 2020

Confusing naming aside, a liveness record should be considered live even if it is draining. Take a look at this code:

if live, liveErr := p.repl.store.cfg.NodeLiveness.IsLive(nextLeaseHolder.NodeID); !live || liveErr != nil {
err = errors.Errorf("not incrementing epoch on n%d because next leaseholder (n%d) not live (err = %v)",
status.Liveness.NodeID, nextLeaseHolder.NodeID, liveErr)
if log.V(1) {
log.Info(ctx, err)
}
} else if err = p.repl.store.cfg.NodeLiveness.IncrementEpoch(ctx, status.Liveness); err != nil {

We are allowed to increment the liveness record's epoch (i.e. kick it out) when it is non-live. This is essentially a hostile take-over reserved for cases in which the old leaseholder becomes unavailable. Doing this before the liveness is actually expired would do little more than disrupt the cluster by making sure nobody can serve requests until the lease is expired.

Taking a step back, the problem is that many consumers of node liveness expect it to provide "liveness" in a more holistic sense, which it currently does not do. I think what we need to do here is to make everyone consume an API that deals in this enum:

const (
NodeLivenessStatus_UNKNOWN NodeLivenessStatus = 0
// DEAD indicates the node is considered dead.
NodeLivenessStatus_DEAD NodeLivenessStatus = 1
// UNAVAILABLE indicates that the node is unavailable - it has not updated its
// liveness record recently enough to be considered live, but has not been
// unavailable long enough to be considered dead.
NodeLivenessStatus_UNAVAILABLE NodeLivenessStatus = 2
// LIVE indicates a live node.
NodeLivenessStatus_LIVE NodeLivenessStatus = 3
// DECOMMISSIONING indicates a node that is in the decommissioning process.
NodeLivenessStatus_DECOMMISSIONING NodeLivenessStatus = 4
// DECOMMISSIONED indicates a node that has finished the decommissioning
// process.
NodeLivenessStatus_DECOMMISSIONED NodeLivenessStatus = 5
)

currently used in this randomly placed method

func LivenessStatus(
l storagepb.Liveness, now time.Time, deadThreshold time.Duration,
) storagepb.NodeLivenessStatus {
if l.IsDead(now, deadThreshold) {
if l.Decommissioning {
return storagepb.NodeLivenessStatus_DECOMMISSIONED
}
return storagepb.NodeLivenessStatus_DEAD
}
if l.Decommissioning {
return storagepb.NodeLivenessStatus_DECOMMISSIONING
}
if l.Draining {
return storagepb.NodeLivenessStatus_UNAVAILABLE
}
if l.IsLive(now) {
return storagepb.NodeLivenessStatus_LIVE
}
return storagepb.NodeLivenessStatus_UNAVAILABLE
}

and also picked up in some http endpoints via

cockroach/pkg/server/admin.go

Lines 1263 to 1273 in f1362c4

// getLivenessStatusMap generates a map from NodeID to LivenessStatus for all
// nodes known to gossip. Nodes that haven't pinged their liveness record for
// more than server.time_until_store_dead are considered dead.
//
// To include all nodes (including ones not in the gossip network), callers
// should consider calling (statusServer).NodesWithLiveness() instead where
// possible.
//
// getLivenessStatusMap() includes removed nodes (dead + decommissioned).
func getLivenessStatusMap(
nl *storage.NodeLiveness, now time.Time, st *cluster.Settings,
.

We should stop passing NodeLiveness and Liveness around as much as we do know. We can wrap it in something that exposes a much narrower API that is actually useful.

There could be a state DRAINING in the enum which is used instead of LIVE if the Draining flag is set in addition. This hasn't been relevant so far because our draining is pretty short and the readiness endpoint already gets the draining signal from the (*Server).grpc.mode, but it might be good to expose it now that our draining periods may become longer.

Thanks for reminding me of the work to do here. The interfaces here are quite lacking, I hope my explanations make some sense.

@andreimatei
Copy link
Contributor

andreimatei commented Feb 19, 2020

Beware of that enum. Call site after call site was using it wrong because there's too many states and different people care about different subsets, plus different timeout thresholds. I've corrected a few and in fact worked to reduce reliance on it in favor of more raw usage of the Liveness struct.

@tbg
Copy link
Member

tbg commented Feb 19, 2020

@andreimatei can you give me some examples (not urgent)? You seem to have come to the opposite of my conclusion so we must have something to learn from each other.

@andreimatei
Copy link
Contributor

Well, just look at the diagram I put on LivenessStatus and how the transitions depend on the draining flag to see that the statuses are complicated.

It's been my experience that different people care about different things. For some draining makes a different, for other draining is irrelevant. Similarly, decomissioning might be relevant or not. Then there's the deadThreshold business, which is an extra dimension on top of the already confusing many states; when one sees so many excruciatingly exhaustive states, you really don't expect to be able to customize them with some interval. Then DECOMISSIONED/DEAD are hard to understand, and depending on how one got its hand on a liveness record, these states might or might not even be possible.
I think we should force everybody to be as explicit as possible about what they care about and what they don't.

For examples of someone getting it wrong, see the commit msg on #43825 and #43873. And I feel like I've massaged others too.

@andreimatei
Copy link
Contributor

Also see for a fix I've made specifically to the call site that prompted this conversation, which was getting decommissioning wrong. #43889

@lunevalex lunevalex added the A-kv-decom-rolling-restart Decommission and Rolling Restarts label Jul 28, 2020
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
craig bot pushed a commit that referenced this issue Jun 25, 2021
66632: *: check node decommissioned/draining state for DistSQL/consistency  r=tbg,knz a=erikgrinaker

The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.

This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.

Touches #66586, touches #45123.

Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.

/cc @cockroachdb/kv 

Co-authored-by: Erik Grinaker <[email protected]>
@AlexTalks
Copy link
Contributor

This issue may be moot now that #105572 is merged I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-gossip A-kv-server Relating to the KV-level RPC server C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants