-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: kvserver: transfer lease when acquiring lease outside preferences #107625
Conversation
fcf9888
to
4ac5754
Compare
Rebased on |
With #107655 backported to release 23.1, the initial preference conformance speed is similar to master. Should be good to take a look! |
Can we let this sit on master for a few days before backporting. This type of change (reordering of the priority within the replicate_queue) often has subtle and unexpected consequences and letting it bake first will help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but agree with letting this bake on master for a bit.
Sounds good. I'll hold off merging. |
Baked on master for 5 days. TYFTRs! |
4ac5754
to
7ded956
Compare
TYFTRs! Rebased on |
|
Previously, there were no e2e tests which assert on lease preference conformance. This commit adds three `lease-preferences` tests, which assert that lease preferences are conformed to. The time taken to conform is also reported in the test logs. The first test, `.../partial-first-preference-down`, takes down one of the preferred locality nodes (1/2). The second test, `.../full-first-preference-down`, takes down both the preferred locality nodes (2/2). The third test, `.../manual-violating-transfer`, manually transfers leases to a locality which violates the configured leases preferences. Informs: cockroachdb#106100 Epic: none Release note: None
When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds). However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region. This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). This behavior can be turned off (on by default), by the cluster setting `kv.lease.check_preferences_on_acquisition.enabled`. Epic: none Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible. Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
This patch places replicas in the replicate queue purgatory when it has a lease violating the lease preferences and it's unable to find a suitable target. This causes the replica to be retried more often. This will only trigger when replicas are eagerly enqueued (typically when we acquire a new lease that violates preferences), since we otherwise don't attempt to enqueue replicas when they don't have a valid lease transfer target. This patch also enables requeuing replicas after a successful rebalance, when the lease violates preferences. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
In cockroachdb#107507, we began eagerly enqueuing into the replicate queue, when acquiring a replica acquired a new lease which violated lease preferences. Lease preferences were only considered violated when the lease itself was valid. In cockroachdb#107507, we saw that it is uncommon, but possible for an invalid lease to be acquired, violate lease preferences and not be enqueued as a result. The end result was a leaseholder violating the applied lease preferences which would not be resolved until the next scanner cycle. Update the eager enqueue check to only check that the replica is the incoming leaseholder when applying the lease, and that the replica violates the applied lease preferences. The check now applies on any lease acquisition, where previously it only occurred on the leaseholder changing. Note the purgatory error introduced in cockroachdb#107507, still checks that the lease is valid and owned by the store before proceeding. It is a condition that the lease must be valid+owned by the store to have a change planned, so whilst it is possible the lease becomes invalid somewhere in-between planning, when the replica applies a valid lease, it will still be enqueued, so purgatory is unnecessary. Fixes: cockroachdb#107862 Release note: None
7ded956
to
dc05a48
Compare
I cherry-picked #107967 on top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm going to merge this tomorrow, barring any test failures overnight on master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would let this bake a little longer on master still since there were new changes. These changes are a little hard to track down and there is not a rush since we aren't going to pull this into a 23.1 release for a few more days still. Maybe merge on Monday? These types of changes are hard to predict the impact as there is often a lot of assumed behavior both in tests and existing code.
Backport 3/3 commits from #107507.
Backport 1/1 commits from #107967.
/cc @cockroachdb/release
When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).
However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.
This PR will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). If the first process fails, the replica will be sent to purgatory and retried soon after.
Additionally, two roachtests are added for lease preference conformance timing. The first test, .../partial-first-preference-down, takes down one of the preferred locality nodes (1/2). The second test, .../full-first-preference-down, takes down both the preferred locality nodes (2/2).
Resolves #106100.
Epic: none
Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.
Release justification: Increases lease preference conformance significantly in the common case. User ask.