Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add support for allocator range check via store #92367

Closed
wants to merge 3 commits into from

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Nov 23, 2022

This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured StorePool), as a node liveness override can be
provided in order to evaluate possible future states.

Depends on #92176.

Part of #91570.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@AlexTalks AlexTalks force-pushed the dpf_store_check_range branch 2 times, most recently from 094f7b5 to f9a087e Compare November 23, 2022 05:07
@AlexTalks AlexTalks force-pushed the dpf_store_check_range branch from f9a087e to abce17f Compare December 15, 2022 19:54
@AlexTalks AlexTalks marked this pull request as ready for review December 15, 2022 19:54
@AlexTalks AlexTalks requested a review from a team as a code owner December 15, 2022 19:54
@AlexTalks AlexTalks force-pushed the dpf_store_check_range branch 2 times, most recently from ee0b569 to 2ebb088 Compare December 17, 2022 00:57
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @kvoli)


pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

	}

	conf, err := confReader.GetSpanConfigForKey(ctx, desc.StartKey)

@irfansharif added you to the review to double check this span config lookup seems ok.

@AlexTalks AlexTalks force-pushed the dpf_store_check_range branch from 2ebb088 to 27b9f38 Compare December 17, 2022 01:07
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

@irfansharif added you to the review to double check this span config lookup seems ok.

I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the Replica type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.

Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.

Aside: I didn't quite follow why we needed CheckRangeAction. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.

Aside: I see that in #91965 we're introducing a new use of NodeLivenessStatus. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:

// NodeLivenessStatus describes the status of a node from the perspective of the
// liveness system. See comment on LivenessStatus() for a description of the
// states.
//
// TODO(irfansharif): We should reconsider usage of NodeLivenessStatus.
// It's unclear if the enum is well considered. It enumerates across two
// distinct set of things: the "membership" status (live/active,
// decommissioning, decommissioned), and the node "process" status (live,
// unavailable, available). It's possible for two of these "states" to be true,
// simultaneously (consider a decommissioned, dead node). It makes for confusing
// semantics, and the code attempting to disambiguate across these states
// (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary.
//
// See #50707 for more details.
.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/store.go line 3687 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the Replica type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.

Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.

Aside: I didn't quite follow why we needed CheckRangeAction. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.

Aside: I see that in #91965 we're introducing a new use of NodeLivenessStatus. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:

// NodeLivenessStatus describes the status of a node from the perspective of the
// liveness system. See comment on LivenessStatus() for a description of the
// states.
//
// TODO(irfansharif): We should reconsider usage of NodeLivenessStatus.
// It's unclear if the enum is well considered. It enumerates across two
// distinct set of things: the "membership" status (live/active,
// decommissioning, decommissioned), and the node "process" status (live,
// unavailable, available). It's possible for two of these "states" to be true,
// simultaneously (consider a decommissioned, dead node). It makes for confusing
// semantics, and the code attempting to disambiguate across these states
// (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary.
//
// See #50707 for more details.
.

@irfansharif The intent is for the store to NOT need to hold a replica in order to be able to process it. This allows us to evaluate the readiness to decommission without cluster-wide RPCs, so we can simply iterate through range descriptors in meta1/2 and see if each one that has a replica on the decommissioning node(s) can be moved.

This is why the distinct code path is needed. Given this, it also means that the conf reader can read span configs over the entire key space as well - I'm not sure how to deal with potential future changes to this interface, though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?

I'm not sure this is actually introducing a new usage of NodeLivenessStatus so much as it's allowing for an override for the purposes of the allocator. This is literally the method the allocator calls to evaluate the status for its purposes, so we have simply added a way to evaluate possible states (rather than simply the current state).

This change adds methods to be to evaluate allocator actions and targets
utilizing a passed-in `StorePool` object, allowing for the allocator to
consider potential scenarios rather than those simply based on the
current liveness.

Depends on cockroachdb#91461, cockroachdb#91965.

Part of cockroachdb#91570.

Release note: None
Adds support for using the allocator to compute the action, if any,
needed to "repair" the range and, if the action requires a replica
addition (including replacements), the target of that addition. This can
be checked by using the new `CheckRangeAction(..)` function as part of a
store's replicate queue, and can use the default store pool or an
override store pool, so that potential states can be evaluated prior to
transition to those states. As such, this feature adds support for
allocator action and target validation prior to decommission, in order
to support decommission pre-checks.

While this is similar to the replicate queue's `PlanOneChange(..)`, this
new check supports evaluation based on a range descriptor, rather than
an individual replica.

Depends on cockroachdb#91941

Part of cockroachdb#91570.

Release note: None
This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured `StorePool`), as a node liveness override can be
provided in order to evaluate possible future states.

Depends on cockroachdb#92176.

Part of cockroachdb#91570.

Release note: None
@AlexTalks AlexTalks force-pushed the dpf_store_check_range branch from 27b9f38 to 91701d0 Compare December 19, 2022 23:43
@blathers-crl blathers-crl bot requested a review from irfansharif December 19, 2022 23:43
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Dec 19, 2022
This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on cockroachdb#92367.

Part of cockroachdb#91571

Release note: None
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/store.go line 3687 at r3 (raw file):
Check out what we populate in here:

// AllocationChangeReplicasOp represents an operation to execute a change
// replicas txn.
type AllocationChangeReplicasOp struct {
usage allocator.RangeUsageInfo
lhStore roachpb.StoreID
chgs roachpb.ReplicationChanges

It contains the target replica when we're {adding,removing} a {voter,non-voter}, something that CheckRangeAction seems to also now do but through a different code path. But also, as your comment outlines, not always -- not when the local store is not a leaseholder since the replicate queue only works on leaseholder replicas. I see in #93758 we're not using the target replica at all, so do we need it? I assume we did in order to bubble up errors when attempting to allocator.Allocate{Voter,NonVoter}WithStorePool. But is that something we're surfacing when evaluating this RPC for replicas on non-leaseholder stores? What's our plan for that? Not returning this unused target would make the code clearer.

Stepping back: What's the reasoning behind avoiding cluster-wide RPCs? Some envelope math suggests decommissioning happens infrequently, probably <1000 times/year for a cluster that can handle 100000x+ that in statements/year: ((1000*100000)/365)/24/60 = 190 statements/sec. So if it's for performance reasons, I suspect we can get away with cluster-wide RPCs given this infrequency. Is there another reason? For cluster-wide RPCs, we only need ~1 round of them, right? Each node can inform the caller what it thinks of its leaseholder replicas that happen to have a replica on the potentially decommissioning node. The caller is free to aggregate across the responses. If replica/lease membership has changed during this fanout, we have a smaller set of ranges to re-evaluate on the leaseholder nodes. We can reduce the fanout by only targeting nodes that hold the same ranges as the decommissioning node.

though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?

This will need an RPC BTW.

@AlexTalks
Copy link
Contributor Author

Replaced by #94024.

@AlexTalks AlexTalks closed this Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants