-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: scatter with rf=1 doesn't randomize replica placement #124171
Comments
The distribution looks more reasonable with a small prototype (#124185) which directly relocates the range when processing a scatter request when the range only has 1 existing replica:
The distribution avoids the existing store entirely unfortunately, however. |
When a range with exactly 1 replica is scattered it could only add an additional replica to a valid store, without also removing the existing replica. This left ranges post-scatter over-replicated and not random -- as only the newly added replica would be randomly placed. The cause is the replicate queue falling back to adding a replica, instead of both adding and removing a replica in an atomic operation. Relocate the range, instead of processing via the replicate queue. The relocation carries out the multi-step add+remove. A remaining issue is that the existing replica will currently never be selected as a target when having an existing high replica count. Distribution without patch: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` Distribution with patch: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 326 | ########## | 326 | ########## 4 | 294 | ########## | 294 | ########## 3 | 193 | ###### | 193 | ###### 5 | 188 | ###### | 188 | ###### ``` Resolves: cockroachdb#124171 Release note: None
It might be possible to entirely lift the RF=1 rebalancing restriction (due to #74077), where we could just issue the change replicas operation as usual like: diff --git a/pkg/kv/kvserver/allocator/plan/util.go b/pkg/kv/kvserver/allocator/plan/util.go
index 33e066a04f5..974d9b45392 100644
--- a/pkg/kv/kvserver/allocator/plan/util.go
+++ b/pkg/kv/kvserver/allocator/plan/util.go
@@ -17,7 +17,6 @@ import (
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/allocator/allocatorimpl"
"github.com/cockroachdb/cockroach/pkg/raft"
"github.com/cockroachdb/cockroach/pkg/roachpb"
- "github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/errors"
"github.com/cockroachdb/redact"
)
@@ -36,51 +35,6 @@ func ReplicationChangesForRebalance(
rebalanceTargetType allocatorimpl.TargetReplicaType,
) (chgs []kvpb.ReplicationChange, performingSwap bool, err error) {
rdesc, found := desc.GetReplicaDescriptor(addTarget.StoreID)
- if rebalanceTargetType == allocatorimpl.VoterTarget && numExistingVoters == 1 {
- // If there's only one replica, the removal target is the
- // leaseholder and this is unsupported and will fail. However,
- // this is also the only way to rebalance in a single-replica
- // range. If we try the atomic swap here, we'll fail doing
- // nothing, and so we stay locked into the current distribution
- // of replicas. (Note that maybeTransferLeaseAway above will not
- // have found a target, and so will have returned (false, nil).
- //
- // Do the best thing we can, which is carry out the addition
- // only, which should succeed, and the next time we touch this
- // range, we will have one more replica and hopefully it will
- // take the lease and remove the current leaseholder.
- //
- // It's possible that "rebalancing deadlock" can occur in other
- // scenarios, it's really impossible to tell from the code given
- // the constraints we support. However, the lease transfer often
- // does not happen spuriously, and we can't enter dangerous
- // configurations sporadically, so this code path is only hit
- // when we know it's necessary, picking the smaller of two evils.
- //
- // See https://github.com/cockroachdb/cockroach/issues/40333.
- log.KvDistribution.Infof(ctx, "can't swap replica due to lease; falling back to add")
-
- // Even when there is only 1 existing voter, there may be other replica
- // types in the range. Check if the add target already has a replica, if so
- // it must be a non-voter or the rebalance is invalid.
- if found && rdesc.Type == roachpb.NON_VOTER {
- // The receiving store already has a non-voting replica. Instead of just
- // adding a voter to the receiving store, we *must* promote the non-voting
- // replica to a voter.
- chgs = kvpb.ReplicationChangesForPromotion(addTarget)
- } else if !found {
- chgs = []kvpb.ReplicationChange{
- {ChangeType: roachpb.ADD_VOTER, Target: addTarget},
- }
- } else {
- return nil, false, errors.AssertionFailedf(
- "invalid rebalancing decision: trying to"+
- " move voter to a store that already has a replica %s for the range", rdesc,
- )
- }
- return chgs, false, err
- }
-
switch rebalanceTargetType {
case allocatorimpl.VoterTarget:
// Check if the target being added already has a non-voting replica. Testing the same reproduction yields a similar result to using relocate range, where the distribution is reasonable between 4/5 stores, but excludes the original leaseholder store entirely. |
With the queues re-enabled and running the repro again with the above patch, the distribution looks passable:
|
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
124284: kvserver: rebalance ranges with one voter using joint configurations r=nvanbenschoten a=kvoli The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to #74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: #108420 Fixes: #124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1. 124444: packer: move `add-apt-repository` call to after `sleep` r=jlinder a=rickystewart Updates are still going on on the VM while this is happening. Epic: none Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to #74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: #108420 Fixes: #124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
Will be closed on backport to 23.1, 23.2 and 24.1. |
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
Describe the problem
When an
AdminScatterRequest
is issued against a range which has a replication factor of 1, the range to be scattered will add an additional replica before completing. This leaves the range with 2 replicas, one on the initial store and another on a random store.To Reproduce
Restore a database with RF=1 into a multi-node cluster.
A targeted reproduction using split, then scatter is shown below.
This results in the following lease and replica distribution after scattering:
Note how despite there being RF=1 and 1k ranges, there are 2k replicas, of which 1k remains on the original store (1). As a result of a replica always remaining on the replica where scatter was evaluated, that store also retains 1/2 the leases, leading to imbalanced lease counts which would affect restore processor planning.
Expected behavior
Scatter randomly places a replica and leaves the range with exactly 1 replica.
Environment:
Additional context
See #108420 for explanation why rebalancing a voter with RF=1 will first add a voter.
Jira issue: CRDB-38742
The text was updated successfully, but these errors were encountered: