-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: transfer lease when acquiring lease outside preferences #106079
kvserver: transfer lease when acquiring lease outside preferences #106079
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
8b580e1
to
03ecf87
Compare
It turns out the replicate queue isn't doing anything when I enqueue these ranges, even though the current lease violates the preferences, and there is another replica that does satisfy the preferences.
This is unexpected to me. cockroach/pkg/kv/kvserver/allocator/plan/replicate.go Lines 199 to 211 in db9239c
@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences? |
Ok, so it'll fall through to attempting to shed leases here: cockroach/pkg/kv/kvserver/allocator/plan/replicate.go Lines 865 to 883 in db9239c
The reason it isn't transferring some of these replicas is simply because we may not have received Raft leadership yet when we enqueue the replica (it'll be transferred over from the old leaseholder), so it omits all other replicas as valid candidates because it can't determine whether they'll need a Raft snapshot or not: cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go Lines 2033 to 2034 in e6e86a8
However, we do also enqueue it when we acquire leadership, so why isn't it transferring it then? cockroach/pkg/kv/kvserver/replica_raft.go Lines 1044 to 1046 in 0a1f8ff
|
First, we're hitting the cockroach/pkg/kv/kvserver/queue.go Lines 467 to 469 in 2b91c38
After increasing that, the I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences. |
Just to confirm, when I disable the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kvoli @andrewbaptist Can you help me figure out how this is all wired up? Where do we enforce lease preferences?
I think you found it. The lease is only checked for transfer in the ConsiderRebalance
action, and only if there were no rebalance opportunities. Effectively making lease preference enforcement a lower priority than rebalancing.
I suppose one approach here would be to just keep requeueing these replicas as long as their lease violates the preferences.
We discussed earlier, we could add a check in PlanOneChange
which returns a leaseholder not leader error marked as a purgatory error:
cockroach/pkg/kv/kvserver/queue.go
Line 119 in 972c379
type PurgatoryError interface { |
The replica will then be retried every 1 minute by default:
cockroach/pkg/kv/kvserver/replicate_queue.go
Lines 74 to 78 in 972c379
// replicateQueuePurgatoryCheckInterval is the interval at which replicas in | |
// the replicate queue purgatory are re-attempted. Note that these replicas | |
// may be re-attempted more frequently by the replicateQueue in case there are | |
// gossip updates that might affect allocation decisions. | |
replicateQueuePurgatoryCheckInterval = 1 * time.Minute |
Reviewed 7 of 7 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
-- commits
line 2 at r1:
Should this commit be here?
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
replicateQueueTimerDuration = 0 // zero duration to process replication greedily replicateQueuePriorityHigh = 1000 // see AllocatorAction.Priority
How did you settle on this priority? Is it deliberately the same as removing a dead voter?
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 254 to 259 in 972c379
case AllocatorReplaceDecommissioningVoter: | |
return 5000 | |
case AllocatorRemoveDeadVoter: | |
return 1000 | |
case AllocatorRemoveDecommissioningVoter: | |
return 900 |
For context - resolving a violated constraint with the right number of voters+non-voters is done at AllocatorConsiderRebalance
priority level, 0
.
Previously, `ConjunctionsCheck` took a store descriptor as an input. However, it only needed to know the store/node attributes and locality. Some upcoming callers (lease acquisition) can't easily construct a full store descriptor since the locking order would cause deadlocks. This patch changes it to only take the attributes and locality instead of the entire store descriptor, and renames it to `CheckConjunction()`. It also adds `CheckStoreConjunction()` as a convenience method that takes a store descriptor, and migrates all existing callers. Epic: none Release note: None
When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds). However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region. This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). Epic: none Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.
This patch places replicas in the replicate queue purgatory when it has a lease violating the lease preferences and it's unable to find a suitable target. This causes the replica to be retried more often. This will only trigger when replicas are eagerly enqueued (typically when we acquire a new lease that violates preferences), since we otherwise don't attempt to enqueue replicas when they don't have a valid lease transfer target. Epic: none Release note: None
03ecf87
to
bb49030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing the replicas in purgatory worked great, thanks for the tip! It actually succeeded in immediately moving all 1000 leases back to the preferred locations, because purgatory will eagerly retry replicas when we add to it. Previously it would only manage about 800 leases or so, the remaining 200 failed because they weren't the Raft leader yet.
I added on a commit. Does the overall direction here make sense to you? If so, I'll clean this up and add some tests.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @kvoli)
Previously, kvoli (Austen) wrote…
Should this commit be here?
No, this was just added while testing because of #106097. Removed it now, since it repros more easily with ALTER RANGE RELOCATE
anyway.
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, kvoli (Austen) wrote…
How did you settle on this priority? Is it deliberately the same as removing a dead voter?
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 254 to 259 in 972c379
case AllocatorReplaceDecommissioningVoter: return 5000 case AllocatorRemoveDeadVoter: return 1000 case AllocatorRemoveDecommissioningVoter: return 900 For context - resolving a violated constraint with the right number of voters+non-voters is done at
AllocatorConsiderRebalance
priority level,0
.
It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall direction makes sense to me.
It may be worthwhile looking into the overhead of checking lease preferences during acquisition and also if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory. I've rarely seen more than 1-3 lease preferences per range but the conjunction checks doing string comparison might be noticeable?
Left a question on the priority.
Reviewed 10 of 10 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
It was mostly arbitrarily. I just wanted to make sure it didn't end up at the bottom of the pile, since this is pretty important.
During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.
Is this fine? It is certainly better than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overhead of checking lease preferences during acquisition
I think this is negligible considering the other work we do during acquisition.
if there's no "live" stores which satisfy the preference, so every replica ends up in purgatory
Yeah, this is the bit I'm worried about. Consider e.g. someone setting a bogus +foo
preference, which will put all replicas in purgatory forever. I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable. In the latter case I think it would be beneficial to put it in purgatory, to recover as soon as the nodes come back online, but it's certainly worth checking the overhead here -- and maybe even skip that bit in the backport.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, kvoli (Austen) wrote…
During voter up-replication (decommissioning, general upreplication, dead node), lease preferences may still not be enforced because they are sitting behind ranges which need to be up-replicated.
Is this fine? It is certainly better than before.
Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is negligible considering the other work we do during acquisition.
Good to know.
I may extend the check to at least see if the preferences can be satisfied at all (assuming all nodes are live), and maybe also to omit it if they can't be satisfied currently because all valid targets are unavailable.
That is a good idea. Another case is where there are no existing voter stores which satisfy the preference, but there are other stores which do. There is no logic which will rebalance replicas in order to then transfer the lease for satisfying a preference. So purgatory in this case wouldn't be too helpful.
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 2811 to 2813 in 17814d8
// replicas that meet lease preferences (among the `existing` replicas). | |
func (a Allocator) PreferredLeaseholders( | |
storePool storepool.AllocatorStorePool, |
Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah, I feel like we may want to move this to the top, considering lease transfers are cheap. Wdyt? On the other hand, we certainly don't want to starve out upreplication, since it leaves us vulnerable to quorum loss.
It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.
Do you plan to backport this part of the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could scope the purgatory criteria to just when there are replicas in need of a snapshot?
I started with that, but the plumbing got a bit annoying, and I figured we'd want to eagerly try to get them back to the preferred regions as soon as they recovered anyway. But it might make sense for a backport, and we can be more eager for 23.2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, kvoli (Austen) wrote…
It should be cheap and quick normally, but this feels like a change that requires more extensive testing if pursued. I don't think we can reason that it will be fine without trying to break it.
Do you plan to backport this part of the change?
I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on the line where the purgatory error is created.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/replicate_queue.go
line 84 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I was, but we can also use priority 0 if we feel like that's safer, and bump the priority for 23.2. Probably better.
Using priority 0 SGTM. That is the priority which is currently assigned to lease preference violation:
return true, 0 |
pkg/kv/kvserver/allocator/plan/replicate.go
line 952 at r6 (raw file):
// leadership, which prevents us from finding appropriate lease targets // since we can't determine if any are behind. if repl.LeaseViolatesPreferences(ctx) {
Continuing the discussion about purgatory criteria here.
An alternative could be:
Checking repl.LeaseViolatesPreferencs
and most of allocator.leaseholderShouldMoveDueToPreferences
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 2133 to 2136 in 938e484
// leaseholderShouldMoveDueToPreferences returns true if the current leaseholder | |
// is in violation of lease preferences _that can otherwise be satisfied_ by | |
// some existing replica. |
We don't want to include the filtering on replicas:
Suspect replicas:
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 2166 to 2169 in 938e484
// Exclude suspect/draining/dead stores. | |
candidates, _ := storePool.LiveAndDeadReplicas( | |
allExistingReplicas, false, /* includeSuspectAndDrainingStores */ | |
) |
Replicas in need of a snapshot:
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
Lines 2173 to 2177 in 938e484
preferred = excludeReplicasInNeedOfSnapshots( | |
ctx, leaseRepl.RaftStatus(), leaseRepl.GetFirstIndex(), preferred) | |
if len(preferred) == 0 { | |
return false | |
} |
Excluding draining/dead stores but not suspect stores is doable but annoying using the storepool.
There's only the option to exclude both suspect and draining, but not just draining. Perhaps it would be fine to include draining stores in the criteria or ignore both.
So a purgatory error is returned when the lease violates the preferences and there is an existing voter replica, on a store which is not dead/suspect/draining/unknown (or just dead/draining/unknown - see above) and satisfies the preferences.
We could add an option to allocator.ShouldTransferLease
which disables checking for replicas in need of a snapshot (not ideal but seems simple):
func (a *Allocator) ShouldTransferLease( |
Closing in favor of #107507. |
When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).
However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.
This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible).
Resolves #106100.
Epic: none
Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.