kvserver: add support for allocator range check via store #92367

AlexTalks · 2022-11-23T04:46:47Z

This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured StorePool), as a node liveness override can be
provided in order to evaluate possible future states.

Depends on #92176.

Part of #91570.

Release note: None

cockroach-teamcity · 2022-11-23T04:46:58Z

This change is

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @kvoli)

pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

	}

	conf, err := confReader.GetSpanConfigForKey(ctx, desc.StartKey)

@irfansharif added you to the review to double check this span config lookup seems ok.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

@irfansharif added you to the review to double check this span config lookup seems ok.

I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the Replica type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.

Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.

Aside: I didn't quite follow why we needed CheckRangeAction. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.

Aside: I see that in #91965 we're introducing a new use of NodeLivenessStatus. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:

cockroach/pkg/kv/kvserver/liveness/livenesspb/liveness.proto

Lines 107 to 120 in d17382d

    
           // NodeLivenessStatus describes the status of a node from the perspective of the 
        
           // liveness system. See comment on LivenessStatus() for a description of the 
        
           // states. 
        
           // 
        
           // TODO(irfansharif): We should reconsider usage of NodeLivenessStatus. 
        
           // It's unclear if the enum is well considered. It enumerates across two 
        
           // distinct set of things: the "membership" status (live/active, 
        
           // decommissioning, decommissioned), and the node "process" status (live, 
        
           // unavailable, available). It's possible for two of these "states" to be true, 
        
           // simultaneously (consider a decommissioned, dead node). It makes for confusing 
        
           // semantics, and the code attempting to disambiguate across these states 
        
           // (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary. 
        
           // 
        
           // See #50707 for more details.

.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the Replica type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.

Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.

Aside: I didn't quite follow why we needed CheckRangeAction. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.

Aside: I see that in #91965 we're introducing a new use of NodeLivenessStatus. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:

cockroach/pkg/kv/kvserver/liveness/livenesspb/liveness.proto

Lines 107 to 120 in d17382d

// NodeLivenessStatus describes the status of a node from the perspective of the

// liveness system. See comment on LivenessStatus() for a description of the

// states.

//

// TODO(irfansharif): We should reconsider usage of NodeLivenessStatus.

// It's unclear if the enum is well considered. It enumerates across two

// distinct set of things: the "membership" status (live/active,

// decommissioning, decommissioned), and the node "process" status (live,

// unavailable, available). It's possible for two of these "states" to be true,

// simultaneously (consider a decommissioned, dead node). It makes for confusing

// semantics, and the code attempting to disambiguate across these states

// (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary.

//

// See #50707 for more details.

.

@irfansharif The intent is for the store to NOT need to hold a replica in order to be able to process it. This allows us to evaluate the readiness to decommission without cluster-wide RPCs, so we can simply iterate through range descriptors in meta1/2 and see if each one that has a replica on the decommissioning node(s) can be moved.

This is why the distinct code path is needed. Given this, it also means that the conf reader can read span configs over the entire key space as well - I'm not sure how to deal with potential future changes to this interface, though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?

I'm not sure this is actually introducing a new usage of NodeLivenessStatus so much as it's allowing for an override for the purposes of the allocator. This is literally the method the allocator calls to evaluate the status for its purposes, so we have simply added a way to evaluate possible states (rather than simply the current state).

This change adds methods to be to evaluate allocator actions and targets utilizing a passed-in `StorePool` object, allowing for the allocator to consider potential scenarios rather than those simply based on the current liveness. Depends on cockroachdb#91461, cockroachdb#91965. Part of cockroachdb#91570. Release note: None

Adds support for using the allocator to compute the action, if any, needed to "repair" the range and, if the action requires a replica addition (including replacements), the target of that addition. This can be checked by using the new `CheckRangeAction(..)` function as part of a store's replicate queue, and can use the default store pool or an override store pool, so that potential states can be evaluated prior to transition to those states. As such, this feature adds support for allocator action and target validation prior to decommission, in order to support decommission pre-checks. While this is similar to the replicate queue's `PlanOneChange(..)`, this new check supports evaluation based on a range descriptor, rather than an individual replica. Depends on cockroachdb#91941 Part of cockroachdb#91570. Release note: None

This change exposes support via a store for checking the allocator action and upreplication target (if applicable) for any range descriptor. The range does not need to have a replica on the given store, nor is it required to evaluate given the current state of the cluster (i.e. the store's configured `StorePool`), as a node liveness override can be provided in order to evaluate possible future states. Depends on cockroachdb#92176. Part of cockroachdb#91570. Release note: None

This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on cockroachdb#92367. Part of cockroachdb#91571 Release note: None

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/store.go line 3687 at r3 (raw file):
Check out what we populate in here:

cockroach/pkg/kv/kvserver/allocation_op.go

Lines 68 to 73 in 7d1793b

    
           // AllocationChangeReplicasOp represents an operation to execute a change 
        
           // replicas txn. 
        
           type AllocationChangeReplicasOp struct { 
        
           	usage             allocator.RangeUsageInfo 
        
           	lhStore           roachpb.StoreID 
        
           	chgs              roachpb.ReplicationChanges

It contains the target replica when we're {adding,removing} a {voter,non-voter}, something that CheckRangeAction seems to also now do but through a different code path. But also, as your comment outlines, not always -- not when the local store is not a leaseholder since the replicate queue only works on leaseholder replicas. I see in #93758 we're not using the target replica at all, so do we need it? I assume we did in order to bubble up errors when attempting to allocator.Allocate{Voter,NonVoter}WithStorePool. But is that something we're surfacing when evaluating this RPC for replicas on non-leaseholder stores? What's our plan for that? Not returning this unused target would make the code clearer.

Stepping back: What's the reasoning behind avoiding cluster-wide RPCs? Some envelope math suggests decommissioning happens infrequently, probably <1000 times/year for a cluster that can handle 100000x+ that in statements/year: ((1000*100000)/365)/24/60 = 190 statements/sec. So if it's for performance reasons, I suspect we can get away with cluster-wide RPCs given this infrequency. Is there another reason? For cluster-wide RPCs, we only need ~1 round of them, right? Each node can inform the caller what it thinks of its leaseholder replicas that happen to have a replica on the potentially decommissioning node. The caller is free to aggregate across the responses. If replica/lease membership has changed during this fanout, we have a smaller set of ranges to re-evaluate on the leaseholder nodes. We can reduce the fanout by only targeting nodes that hold the same ranges as the decommissioning node.

though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?

This will need an RPC BTW.

AlexTalks · 2023-01-07T04:05:48Z

Replaced by #94024.

AlexTalks force-pushed the dpf_store_check_range branch 2 times, most recently from 094f7b5 to f9a087e Compare November 23, 2022 05:07

AlexTalks force-pushed the dpf_store_check_range branch from f9a087e to abce17f Compare December 15, 2022 19:54

AlexTalks marked this pull request as ready for review December 15, 2022 19:54

AlexTalks requested a review from a team as a code owner December 15, 2022 19:54

AlexTalks mentioned this pull request Dec 15, 2022

server: evaluate decommission pre-checks #93758

Merged

AlexTalks force-pushed the dpf_store_check_range branch 2 times, most recently from ee0b569 to 2ebb088 Compare December 17, 2022 00:57

AlexTalks requested review from kvoli and irfansharif December 17, 2022 00:58

AlexTalks commented Dec 17, 2022

View reviewed changes

AlexTalks force-pushed the dpf_store_check_range branch from 2ebb088 to 27b9f38 Compare December 17, 2022 01:07

irfansharif reviewed Dec 19, 2022

View reviewed changes

AlexTalks commented Dec 19, 2022

View reviewed changes

AlexTalks added 3 commits December 19, 2022 18:15

AlexTalks force-pushed the dpf_store_check_range branch from 27b9f38 to 91701d0 Compare December 19, 2022 23:43

blathers-crl bot requested a review from irfansharif December 19, 2022 23:43

irfansharif reviewed Dec 20, 2022

View reviewed changes

AlexTalks mentioned this pull request Dec 21, 2022

kvserver: support checking allocator action and target by range #94024

Merged

AlexTalks closed this Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: add support for allocator range check via store #92367

kvserver: add support for allocator range check via store #92367

AlexTalks commented Nov 23, 2022 •

edited

Loading

cockroach-teamcity commented Nov 23, 2022

AlexTalks left a comment

irfansharif left a comment

AlexTalks left a comment

irfansharif left a comment

AlexTalks commented Jan 7, 2023

	// NodeLivenessStatus describes the status of a node from the perspective of the
	// liveness system. See comment on LivenessStatus() for a description of the
	// states.
	//
	// TODO(irfansharif): We should reconsider usage of NodeLivenessStatus.
	// It's unclear if the enum is well considered. It enumerates across two
	// distinct set of things: the "membership" status (live/active,
	// decommissioning, decommissioned), and the node "process" status (live,
	// unavailable, available). It's possible for two of these "states" to be true,
	// simultaneously (consider a decommissioned, dead node). It makes for confusing
	// semantics, and the code attempting to disambiguate across these states
	// (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary.
	//
	// See #50707 for more details.

	// AllocationChangeReplicasOp represents an operation to execute a change
	// replicas txn.
	type AllocationChangeReplicasOp struct {
	usage allocator.RangeUsageInfo
	lhStore roachpb.StoreID
	chgs roachpb.ReplicationChanges

kvserver: add support for allocator range check via store #92367

kvserver: add support for allocator range check via store #92367

Conversation

AlexTalks commented Nov 23, 2022 • edited Loading

cockroach-teamcity commented Nov 23, 2022

AlexTalks left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

AlexTalks commented Jan 7, 2023

AlexTalks commented Nov 23, 2022 •

edited

Loading