Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: transfer lease when acquiring lease outside preferences #106079

Closed

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jul 3, 2023

When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.

This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible).

Resolves #106100.
Epic: none

Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.

@erikgrinaker erikgrinaker self-assigned this Jul 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the lease-preference-enqueue branch 4 times, most recently from 8b580e1 to 03ecf87 Compare July 4, 2023 13:46
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 4, 2023

It turns out the replicate queue isn't doing anything when I enqueue these ranges, even though the current lease violates the preferences, and there is another replica that does satisfy the preferences.

I230704 14:03:17.081396 513 kv/kvserver/replica_proposal.go:395 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30256  new range lease repl=(n5,s5):4 seq=43 start=1688479397.072916182,0 epo=17 pro=1688479397.077098474,0 following repl=(n5,s5):4 seq=42 start=1688479397.072916182,0 exp=1688479403.072851248,0 pro=1688479397.072851248,0
I230704 14:03:17.081545 513 kv/kvserver/replica_proposal.go:559 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30261  acquired lease violates lease preferences, enqueueing for transfer [lease=repl=(n5,s5):4 seq=43 start=1688479397.072916182,0 epo=17 pro=1688479397.077098474,0 preferences=[{[‹+rack=0›]}]]
I230704 14:03:17.081591 513 kv/kvserver/replica_proposal.go:562 ⋮ [T1,n5,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›,raft] 30262  Add
I230704 14:03:17.236849 18410 kv/kvserver/queue.go:989 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30919  processing replica
I230704 14:03:17.236873 18410 kv/kvserver/queue.go:1023 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30920  processing...
I230704 14:03:17.237519 18410 kv/kvserver/replicate_queue.go:848 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30921  planned action=‹consider rebalance› op=plan.AllocationNoop
I230704 14:03:17.237542 18410 kv/kvserver/queue.go:1033 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30922  processing... done
I230704 14:03:17.237556 18410 kv/kvserver/queue.go:942 ⋮ [T1,n5,replicate,s5,r588/4:‹/Table/106/1/70{48830…-67259…}›] 30923  done 726.202µs

This is unexpected to me. ShouldPlanChange seems to expect the queue to transfer violating leases too, but it doesn't appear to actually do so.

// If the lease is valid, check to see if we should transfer it.
if canTransferLeaseFrom(ctx, repl) &&
rp.allocator.ShouldTransferLease(
ctx,
rp.storePool,
conf,
voterReplicas,
repl,
repl.RangeUsageInfo(),
) {
log.KvDistribution.VEventf(ctx, 2, "lease transfer needed, enqueuing")
return true, 0
}

@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 4, 2023

Ok, so it'll fall through to attempting to shed leases here:

// No rebalance target was found, check whether we are able and should
// transfer the lease away to another store.
if !ok {
if !canTransferLeaseFrom(ctx, repl) {
return nil, stats, nil
}
return rp.shedLeaseTarget(
ctx,
repl,
desc,
conf,
allocator.TransferLeaseOptions{
Goal: allocator.FollowTheWorkload,
ExcludeLeaseRepl: false,
CheckCandidateFullness: true,
},
), stats, nil
}

The reason it isn't transferring some of these replicas is simply because we may not have received Raft leadership yet when we enqueue the replica (it'll be transferred over from the old leaseholder), so it omits all other replicas as valid candidates because it can't determine whether they'll need a Raft snapshot or not:

candidates = append(validSnapshotCandidates, excludeReplicasInNeedOfSnapshots(
ctx, status, leaseRepl.GetFirstIndex(), candidates)...)

However, we do also enqueue it when we acquire leadership, so why isn't it transferring it then?

if becameLeader && r.store.replicateQueue != nil {
r.store.replicateQueue.MaybeAddAsync(ctx, r, r.store.Clock().NowAsClockTimestamp())
}

@erikgrinaker
Copy link
Contributor Author

However, we do also enqueue it when we acquire leadership, so why isn't it transferring it then?

First, we're hitting the MaybeAddAsync semaphore limit of 20.

if cfg.addOrMaybeAddSemSize == 0 {
cfg.addOrMaybeAddSemSize = 20
}

After increasing that, the MaybeAdd is failing because we just became leader, and haven't received any info from our followers yet, so they're all in StateProbe.

I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences.

@erikgrinaker
Copy link
Contributor Author

Just to confirm, when I disable the excludeReplicasInNeedOfSnapshots checks, leases immediately move back to their preferred regions with this patch.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences?

I think you found it. The lease is only checked for transfer in the ConsiderRebalance action, and only if there were no rebalance opportunities. Effectively making lease preference enforcement a lower priority than rebalancing.

I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences.

We discussed earlier, we could add a check in PlanOneChange which returns a leaseholder not leader error marked as a purgatory error:

type PurgatoryError interface {

The replica will then be retried every 1 minute by default:

// replicateQueuePurgatoryCheckInterval is the interval at which replicas in
// the replicate queue purgatory are re-attempted. Note that these replicas
// may be re-attempted more frequently by the replicateQueue in case there are
// gossip updates that might affect allocation decisions.
replicateQueuePurgatoryCheckInterval = 1 * time.Minute

Reviewed 7 of 7 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


-- commits line 2 at r1:
Should this commit be here?


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

	replicateQueueTimerDuration = 0 // zero duration to process replication greedily

	replicateQueuePriorityHigh = 1000 // see AllocatorAction.Priority

How did you settle on this priority? Is it deliberately the same as removing a dead voter?

case AllocatorReplaceDecommissioningVoter:
return 5000
case AllocatorRemoveDeadVoter:
return 1000
case AllocatorRemoveDecommissioningVoter:
return 900

For context - resolving a violated constraint with the right number of voters+non-voters is done at AllocatorConsiderRebalance priority level, 0.

Previously, `ConjunctionsCheck` took a store descriptor as an input.
However, it only needed to know the store/node attributes and locality.
Some upcoming callers (lease acquisition) can't easily construct a full
store descriptor since the locking order would cause deadlocks.

This patch changes it to only take the attributes and locality instead
of the entire store descriptor, and renames it to `CheckConjunction()`.
It also adds `CheckStoreConjunction()` as a convenience method that
takes a store descriptor, and migrates all existing callers.

Epic: none
Release note: None
When a leaseholder is lost, any surviving replica may acquire the lease,
even if it violates lease preferences. There are two main reasons for
this: we need to elect a new Raft leader who will acquire the lease,
which is agnostic to lease preferences, and there may not even be any
surviving replicas that satisfy the lease preferences at all, so we
don't want to keep the range unavailable while we try to figure this out
(considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to
transfer the lease back to a replica that conforms with the preferences,
which can take several minutes. In multi-region clusters, this can cause
severe latency degradation if the lease is acquired in a remote region.

This patch will detect lease preference violations when a replica
acquires a new lease, and eagerly enqueue it in the replicate queue for
transfer (if possible).

Epic: none
Release note (bug fix): When losing a leaseholder and using lease
preferences, the lease can be acquired by any other replica (regardless
of lease preferences) in order to restore availability as soon as
possible. The new leaseholder will now immediately check if it violates
the lease preferences, and attempt to transfer the lease to a replica
that satisfies the preferences if possible.
This patch places replicas in the replicate queue purgatory when
it has a lease violating the lease preferences and it's unable to find a
suitable target. This causes the replica to be retried more often.

This will only trigger when replicas are eagerly enqueued (typically
when we acquire a new lease that violates preferences), since we
otherwise don't attempt to enqueue replicas when they don't have a valid
lease transfer target.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the lease-preference-enqueue branch from 03ecf87 to bb49030 Compare July 5, 2023 20:10
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing the replicas in purgatory worked great, thanks for the tip! It actually succeeded in immediately moving all 1000 leases back to the preferred locations, because purgatory will eagerly retry replicas when we add to it. Previously it would only manage about 800 leases or so, the remaining 200 failed because they weren't the Raft leader yet.

I added on a commit. Does the overall direction here make sense to you? If so, I'll clean this up and add some tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @kvoli)


-- commits line 2 at r1:

Previously, kvoli (Austen) wrote…

Should this commit be here?

No, this was just added while testing because of #106097. Removed it now, since it repros more easily with ALTER RANGE RELOCATE anyway.


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

How did you settle on this priority? Is it deliberately the same as removing a dead voter?

case AllocatorReplaceDecommissioningVoter:
return 5000
case AllocatorRemoveDeadVoter:
return 1000
case AllocatorRemoveDecommissioningVoter:
return 900

For context - resolving a violated constraint with the right number of voters+non-voters is done at AllocatorConsiderRebalance priority level, 0.

It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction makes sense to me.

It may be worthwhile looking into the overhead of checking lease preferences during acquisition and also if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory. I've rarely seen more than 1-3 lease preferences per range but the conjunction checks doing string comparison might be noticeable?

Left a question on the priority.

Reviewed 10 of 10 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.

During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.

Is this fine? It is certainly better than before.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overhead of checking lease preferences during acquisition

I think this is negligible considering the other work we do during acquisition.

if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory

Yeah, this is the bit I'm worried about. Consider e.g. someone setting a bogus +foo preference, which will put all replicas in purgatory forever. I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable. In the latter case I think it would be beneficial to put it in purgatory, to recover as soon as the nodes come back online, but it's certainly worth checking the overhead here -- and maybe even skip that bit in the backport.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.

Is this fine? It is certainly better than before.

Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is negligible considering the other work we do during acquisition.

Good to know.

I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable.

That is a good idea. Another case is where there are no existing voter stores which satisfy the preference, but there are other stores which do. There is no logic which will rebalance replicas in order to then transfer the lease for satisfying a preference. So purgatory in this case wouldn't be too helpful.

// replicas that meet lease preferences (among the `existing` replicas).
func (a Allocator) PreferredLeaseholders(
storePool storepool.AllocatorStorePool,

Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.

It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.

Do you plan to backport this part of the change?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?

I started with that, but the plumbing got a bit annoying, and I figured we'd want to eagerly try to get them back to the preferred regions as soon as they recovered anyway. But it might make sense for a backport, and we can be more eager for 23.2.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, kvoli (Austen) wrote…

It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.

Do you plan to backport this part of the change?

I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on the line where the purgatory error is created.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replicate_queue.go line 84 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.

Using priority 0 SGTM. That is the priority which is currently assigned to lease preference violation:


pkg/kv/kvserver/allocator/plan/replicate.go line 952 at r6 (raw file):

		// leadership, which prevents us from finding appropriate lease targets
		// since we can't determine if any are behind.
		if repl.LeaseViolatesPreferences(ctx) {

Continuing the discussion about purgatory criteria here.

An alternative could be:

Checking repl.LeaseViolatesPreferencs and most of allocator.leaseholderShouldMoveDueToPreferences

// leaseholderShouldMoveDueToPreferences returns true if the current leaseholder
// is in violation of lease preferences _that can otherwise be satisfied_ by
// some existing replica.

We don't want to include the filtering on replicas:

Suspect replicas:

// Exclude suspect/draining/dead stores.
candidates, _ := storePool.LiveAndDeadReplicas(
allExistingReplicas, false, /* includeSuspectAndDrainingStores */
)

Replicas in need of a snapshot:

preferred = excludeReplicasInNeedOfSnapshots(
ctx, leaseRepl.RaftStatus(), leaseRepl.GetFirstIndex(), preferred)
if len(preferred) == 0 {
return false
}

Excluding draining/dead stores but not suspect stores is doable but annoying using the storepool.

There's only the option to exclude both suspect and draining, but not just draining. Perhaps it would be fine to include draining stores in the criteria or ignore both.

So a purgatory error is returned when the lease violates the preferences and there is an existing voter replica, on a store which is not dead/suspect/draining/unknown (or just dead/draining/unknown - see above) and satisfies the preferences.

We could add an option to allocator.ShouldTransferLease which disables checking for replicas in need of a snapshot (not ideal but seems simple):

func (a *Allocator) ShouldTransferLease(

@erikgrinaker
Copy link
Contributor Author

Closing in favor of #107507.

@erikgrinaker erikgrinaker deleted the lease-preference-enqueue branch November 14, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: eagerly move leases to preferred regions
3 participants