kvserver: transfer lease when acquiring lease outside preferences #106079

erikgrinaker · 2023-07-03T20:56:04Z

When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.

This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible).

Resolves #106100.
Epic: none

Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.

blathers-crl · 2023-07-03T20:56:08Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-07-03T20:56:13Z

This change is

erikgrinaker · 2023-07-04T14:15:26Z

It turns out the replicate queue isn't doing anything when I enqueue these ranges, even though the current lease violates the preferences, and there is another replica that does satisfy the preferences.

I230704 14:03:17.081396 513 kv/kvserver/replica_proposal.go:395 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30256  new range lease repl=(n5,s5):4 seq=43 start=1688479397.072916182,0 epo=17 pro=1688479397.077098474,0 following repl=(n5,s5):4 seq=42 start=1688479397.072916182,0 exp=1688479403.072851248,0 pro=1688479397.072851248,0
I230704 14:03:17.081545 513 kv/kvserver/replica_proposal.go:559 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30261  acquired lease violates lease preferences, enqueueing for transfer [lease=repl=(n5,s5):4 seq=43 start=1688479397.072916182,0 epo=17 pro=1688479397.077098474,0 preferences=[{[‹+rack=0›]}]]
I230704 14:03:17.081591 513 kv/kvserver/replica_proposal.go:562 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30262  Add
I230704 14:03:17.236849 18410 kv/kvserver/queue.go:989 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30919  processing replica
I230704 14:03:17.236873 18410 kv/kvserver/queue.go:1023 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30920  processing...
I230704 14:03:17.237519 18410 kv/kvserver/replicate_queue.go:848 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30921  planned action=‹consider rebalance› op=plan.AllocationNoop
I230704 14:03:17.237542 18410 kv/kvserver/queue.go:1033 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30922  processing... done
I230704 14:03:17.237556 18410 kv/kvserver/queue.go:942 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30923  done 726.202µs

This is unexpected to me. ShouldPlanChange seems to expect the queue to transfer violating leases too, but it doesn't appear to actually do so.

cockroach/pkg/kv/kvserver/allocator/plan/replicate.go

Lines 199 to 211 in db9239c

    
           // If the lease is valid, check to see if we should transfer it. 
        
           if canTransferLeaseFrom(ctx, repl) && 
        
           	rp.allocator.ShouldTransferLease( 
        
           		ctx, 
        
           		rp.storePool, 
        
           		conf, 
        
           		voterReplicas, 
        
           		repl, 
        
           		repl.RangeUsageInfo(), 
        
           	) { 
        
           	log.KvDistribution.VEventf(ctx, 2, "lease transfer needed, enqueuing") 
        
           	return true, 0 
        
           }

@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences?

erikgrinaker · 2023-07-04T14:39:21Z

Ok, so it'll fall through to attempting to shed leases here:

cockroach/pkg/kv/kvserver/allocator/plan/replicate.go

Lines 865 to 883 in db9239c

    
           // No rebalance target was found, check whether we are able and should 
        
           // transfer the lease away to another store. 
        
           if !ok { 
        
           	if !canTransferLeaseFrom(ctx, repl) { 
        
           		return nil, stats, nil 
        
           	} 
        
           	return rp.shedLeaseTarget( 
        
           		ctx, 
        
           		repl, 
        
           		desc, 
        
           		conf, 
        
           		allocator.TransferLeaseOptions{ 
        
           			Goal:                   allocator.FollowTheWorkload, 
        
           			ExcludeLeaseRepl:       false, 
        
           			CheckCandidateFullness: true, 
        
           		}, 
        
           	), stats, nil 
        
           }

The reason it isn't transferring some of these replicas is simply because we may not have received Raft leadership yet when we enqueue the replica (it'll be transferred over from the old leaseholder), so it omits all other replicas as valid candidates because it can't determine whether they'll need a Raft snapshot or not:

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 2033 to 2034 in e6e86a8

    
           candidates = append(validSnapshotCandidates, excludeReplicasInNeedOfSnapshots( 
        
           	ctx, status, leaseRepl.GetFirstIndex(), candidates)...)

However, we do also enqueue it when we acquire leadership, so why isn't it transferring it then?

cockroach/pkg/kv/kvserver/replica_raft.go

Lines 1044 to 1046 in 0a1f8ff

    
           if becameLeader && r.store.replicateQueue != nil { 
        
           	r.store.replicateQueue.MaybeAddAsync(ctx, r, r.store.Clock().NowAsClockTimestamp()) 
        
           }

erikgrinaker · 2023-07-04T16:14:12Z

However, we do also enqueue it when we acquire leadership, so why isn't it transferring it then?

First, we're hitting the MaybeAddAsync semaphore limit of 20.

cockroach/pkg/kv/kvserver/queue.go

Lines 467 to 469 in 2b91c38

    
           if cfg.addOrMaybeAddSemSize == 0 { 
        
           	cfg.addOrMaybeAddSemSize = 20 
        
           }

After increasing that, the MaybeAdd is failing because we just became leader, and haven't received any info from our followers yet, so they're all in StateProbe.

I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences.

erikgrinaker · 2023-07-05T07:56:28Z

Just to confirm, when I disable the excludeReplicasInNeedOfSnapshots checks, leases immediately move back to their preferred regions with this patch.

kvoli

@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences?

I think you found it. The lease is only checked for transfer in the ConsiderRebalance action, and only if there were no rebalance opportunities. Effectively making lease preference enforcement a lower priority than rebalancing.

I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences.

We discussed earlier, we could add a check in PlanOneChange which returns a leaseholder not leader error marked as a purgatory error:

cockroach/pkg/kv/kvserver/queue.go

Line 119 in 972c379

type PurgatoryError interface {

The replica will then be retried every 1 minute by default:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 74 to 78 in 972c379

    
           // replicateQueuePurgatoryCheckInterval is the interval at which replicas in 
        
           // the replicate queue purgatory are re-attempted. Note that these replicas 
        
           // may be re-attempted more frequently by the replicateQueue in case there are 
        
           // gossip updates that might affect allocation decisions. 
        
           replicateQueuePurgatoryCheckInterval = 1 * time.Minute

Reviewed 7 of 7 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

-- commits line 2 at r1:
Should this commit be here?

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

	replicateQueueTimerDuration = 0 // zero duration to process replication greedily

	replicateQueuePriorityHigh = 1000 // see AllocatorAction.Priority

How did you settle on this priority? Is it deliberately the same as removing a dead voter?

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 254 to 259 in 972c379

    
           case AllocatorReplaceDecommissioningVoter: 
        
           	return 5000 
        
           case AllocatorRemoveDeadVoter: 
        
           	return 1000 
        
           case AllocatorRemoveDecommissioningVoter: 
        
           	return 900

For context - resolving a violated constraint with the right number of voters+non-voters is done at AllocatorConsiderRebalance priority level, 0.

Previously, `ConjunctionsCheck` took a store descriptor as an input. However, it only needed to know the store/node attributes and locality. Some upcoming callers (lease acquisition) can't easily construct a full store descriptor since the locking order would cause deadlocks. This patch changes it to only take the attributes and locality instead of the entire store descriptor, and renames it to `CheckConjunction()`. It also adds `CheckStoreConjunction()` as a convenience method that takes a store descriptor, and migrates all existing callers. Epic: none Release note: None

When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds). However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region. This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). Epic: none Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.

This patch places replicas in the replicate queue purgatory when it has a lease violating the lease preferences and it's unable to find a suitable target. This causes the replica to be retried more often. This will only trigger when replicas are eagerly enqueued (typically when we acquire a new lease that violates preferences), since we otherwise don't attempt to enqueue replicas when they don't have a valid lease transfer target. Epic: none Release note: None

erikgrinaker

Placing the replicas in purgatory worked great, thanks for the tip! It actually succeeded in immediately moving all 1000 leases back to the preferred locations, because purgatory will eagerly retry replicas when we add to it. Previously it would only manage about 800 leases or so, the remaining 200 failed because they weren't the Raft leader yet.

I added on a commit. Does the overall direction here make sense to you? If so, I'll clean this up and add some tests.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @kvoli)

-- commits line 2 at r1:

Previously, kvoli (Austen) wrote…

Should this commit be here?

No, this was just added while testing because of #106097. Removed it now, since it repros more easily with ALTER RANGE RELOCATE anyway.

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

How did you settle on this priority? Is it deliberately the same as removing a dead voter?

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 254 to 259 in 972c379

case AllocatorReplaceDecommissioningVoter:

return 5000

case AllocatorRemoveDeadVoter:

return 1000

case AllocatorRemoveDecommissioningVoter:

return 900

For context - resolving a violated constraint with the right number of voters+non-voters is done at AllocatorConsiderRebalance priority level, 0.

It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.

kvoli

Overall direction makes sense to me.

It may be worthwhile looking into the overhead of checking lease preferences during acquisition and also if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory. I've rarely seen more than 1-3 lease preferences per range but the conjunction checks doing string comparison might be noticeable?

Left a question on the priority.

Reviewed 10 of 10 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.

During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.

Is this fine? It is certainly better than before.

erikgrinaker

overhead of checking lease preferences during acquisition

I think this is negligible considering the other work we do during acquisition.

if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory

Yeah, this is the bit I'm worried about. Consider e.g. someone setting a bogus +foo preference, which will put all replicas in purgatory forever. I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable. In the latter case I think it would be beneficial to put it in purgatory, to recover as soon as the nodes come back online, but it's certainly worth checking the overhead here -- and maybe even skip that bit in the backport.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.

Is this fine? It is certainly better than before.

Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.

kvoli

I think this is negligible considering the other work we do during acquisition.

Good to know.

I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable.

That is a good idea. Another case is where there are no existing voter stores which satisfy the preference, but there are other stores which do. There is no logic which will rebalance replicas in order to then transfer the lease for satisfying a preference. So purgatory in this case wouldn't be too helpful.

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 2811 to 2813 in 17814d8

    
           // replicas that meet lease preferences (among the `existing` replicas). 
        
           func (a Allocator) PreferredLeaseholders( 
        
           	storePool storepool.AllocatorStorePool,

Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.

It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.

Do you plan to backport this part of the change?

erikgrinaker

Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?

I started with that, but the plumbing got a bit annoying, and I figured we'd want to eagerly try to get them back to the preferred regions as soon as they recovered anyway. But it might make sense for a backport, and we can be more eager for 23.2.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.

Do you plan to backport this part of the change?

I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.

kvoli

Commented on the line where the purgatory error is created.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.

Using priority 0 SGTM. That is the priority which is currently assigned to lease preference violation:

cockroach/pkg/kv/kvserver/allocator/plan/replicate.go

Line 210 in 938e484

return true, 0

pkg/kv/kvserver/allocator/plan/replicate.go line 952 at r6 (raw file):

		// leadership, which prevents us from finding appropriate lease targets
		// since we can't determine if any are behind.
		if repl.LeaseViolatesPreferences(ctx) {

Continuing the discussion about purgatory criteria here.

An alternative could be:

Checking repl.LeaseViolatesPreferencs and most of allocator.leaseholderShouldMoveDueToPreferences

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 2133 to 2136 in 938e484

    
           // leaseholderShouldMoveDueToPreferences returns true if the current leaseholder 
        
           // is in violation of lease preferences _that can otherwise be satisfied_ by 
        
           // some existing replica.

We don't want to include the filtering on replicas:

Suspect replicas:

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 2166 to 2169 in 938e484

    
           // Exclude suspect/draining/dead stores. 
        
           candidates, _ := storePool.LiveAndDeadReplicas( 
        
           	allExistingReplicas, false, /* includeSuspectAndDrainingStores */ 
        
           )

Replicas in need of a snapshot:

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 2173 to 2177 in 938e484

    
           preferred = excludeReplicasInNeedOfSnapshots( 
        
           	ctx, leaseRepl.RaftStatus(), leaseRepl.GetFirstIndex(), preferred) 
        
           if len(preferred) == 0 { 
        
           	return false 
        
           }

Excluding draining/dead stores but not suspect stores is doable but annoying using the storepool.

There's only the option to exclude both suspect and draining, but not just draining. Perhaps it would be fine to include draining stores in the criteria or ignore both.

So a purgatory error is returned when the lease violates the preferences and there is an existing voter replica, on a store which is not dead/suspect/draining/unknown (or just dead/draining/unknown - see above) and satisfies the preferences.

We could add an option to allocator.ShouldTransferLease which disables checking for replicas in need of a snapshot (not ideal but seems simple):

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Line 2489 in 938e484

func (a *Allocator) ShouldTransferLease(

erikgrinaker · 2023-07-26T08:32:23Z

Closing in favor of #107507.

erikgrinaker self-assigned this Jul 3, 2023

erikgrinaker force-pushed the lease-preference-enqueue branch 4 times, most recently from 8b580e1 to 03ecf87 Compare July 4, 2023 13:46

kvoli reviewed Jul 5, 2023

View reviewed changes

erikgrinaker added 3 commits July 5, 2023 17:29

erikgrinaker force-pushed the lease-preference-enqueue branch from 03ecf87 to bb49030 Compare July 5, 2023 20:10

erikgrinaker commented Jul 5, 2023

View reviewed changes

kvoli reviewed Jul 6, 2023

View reviewed changes

erikgrinaker commented Jul 6, 2023

View reviewed changes

kvoli reviewed Jul 6, 2023

View reviewed changes

erikgrinaker commented Jul 6, 2023

View reviewed changes

kvoli reviewed Jul 7, 2023

View reviewed changes

erikgrinaker mentioned this pull request Jul 18, 2023

kvserver: eagerly move leases to preferred regions #106100

Closed

kvoli assigned kvoli and unassigned erikgrinaker Jul 18, 2023

kvoli mentioned this pull request Jul 21, 2023

kvserver: make lease preference enforcement more responsive #107126

Closed

erikgrinaker closed this Jul 26, 2023

erikgrinaker deleted the lease-preference-enqueue branch November 14, 2023 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: transfer lease when acquiring lease outside preferences #106079

kvserver: transfer lease when acquiring lease outside preferences #106079

erikgrinaker commented Jul 3, 2023 •

edited

Loading

blathers-crl bot commented Jul 3, 2023

cockroach-teamcity commented Jul 3, 2023

erikgrinaker commented Jul 4, 2023 •

edited

Loading

erikgrinaker commented Jul 4, 2023 •

edited

Loading

erikgrinaker commented Jul 4, 2023

erikgrinaker commented Jul 5, 2023

kvoli left a comment

erikgrinaker left a comment

kvoli left a comment

erikgrinaker left a comment

kvoli left a comment •

edited

Loading

erikgrinaker left a comment

kvoli left a comment

erikgrinaker commented Jul 26, 2023

	// replicateQueuePurgatoryCheckInterval is the interval at which replicas in
	// the replicate queue purgatory are re-attempted. Note that these replicas
	// may be re-attempted more frequently by the replicateQueue in case there are
	// gossip updates that might affect allocation decisions.
	replicateQueuePurgatoryCheckInterval = 1 * time.Minute

	case AllocatorReplaceDecommissioningVoter:
	return 5000
	case AllocatorRemoveDeadVoter:
	return 1000
	case AllocatorRemoveDecommissioningVoter:
	return 900

	// replicas that meet lease preferences (among the `existing` replicas).
	func (a Allocator) PreferredLeaseholders(
	storePool storepool.AllocatorStorePool,


	// leaseholderShouldMoveDueToPreferences returns true if the current leaseholder
	// is in violation of lease preferences _that can otherwise be satisfied_ by
	// some existing replica.

	// Exclude suspect/draining/dead stores.
	candidates, _ := storePool.LiveAndDeadReplicas(
	allExistingReplicas, false, /* includeSuspectAndDrainingStores */
	)

	preferred = excludeReplicasInNeedOfSnapshots(
	ctx, leaseRepl.RaftStatus(), leaseRepl.GetFirstIndex(), preferred)
	if len(preferred) == 0 {
	return false
	}

kvserver: transfer lease when acquiring lease outside preferences #106079

kvserver: transfer lease when acquiring lease outside preferences #106079

Conversation

erikgrinaker commented Jul 3, 2023 • edited Loading

blathers-crl bot commented Jul 3, 2023

cockroach-teamcity commented Jul 3, 2023

erikgrinaker commented Jul 4, 2023 • edited Loading

erikgrinaker commented Jul 4, 2023 • edited Loading

erikgrinaker commented Jul 4, 2023

erikgrinaker commented Jul 5, 2023

kvoli left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

kvoli left a comment • edited Loading

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jul 26, 2023

erikgrinaker commented Jul 3, 2023 •

edited

Loading

erikgrinaker commented Jul 4, 2023 •

edited

Loading

erikgrinaker commented Jul 4, 2023 •

edited

Loading

kvoli left a comment •

edited

Loading