Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kvserver: transfer lease when acquiring lease outside preferences #107625

Merged
merged 4 commits into from
Aug 7, 2023

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Jul 26, 2023

Backport 3/3 commits from #107507.
Backport 1/1 commits from #107967.

/cc @cockroachdb/release


When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.

This PR will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). If the first process fails, the replica will be sent to purgatory and retried soon after.

Additionally, two roachtests are added for lease preference conformance timing. The first test, .../partial-first-preference-down, takes down one of the preferred locality nodes (1/2). The second test, .../full-first-preference-down, takes down both the preferred locality nodes (2/2).

Resolves #106100.
Epic: none

Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.


Release justification: Increases lease preference conformance significantly in the common case. User ask.

@kvoli kvoli self-assigned this Jul 26, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli changed the title kvserver: transfer lease when acquiring lease outside preferences release-23.1: kvserver: transfer lease when acquiring lease outside preferences Jul 26, 2023
@kvoli kvoli force-pushed the backport23.1-107507 branch 3 times, most recently from fcf9888 to 4ac5754 Compare July 26, 2023 17:07
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 26, 2023

Rebased on release-23.1 to pick up dependency backport #107622, which merged.

@kvoli kvoli marked this pull request as ready for review July 26, 2023 19:09
@kvoli kvoli requested review from a team as code owners July 26, 2023 19:09
@kvoli kvoli requested review from a team, herkolategan, renatolabs, erikgrinaker and andrewbaptist and removed request for a team July 26, 2023 19:09
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 26, 2023

With #107655 backported to release 23.1, the initial preference conformance speed is similar to master.

Should be good to take a look!

@andrewbaptist
Copy link
Collaborator

Can we let this sit on master for a few days before backporting. This type of change (reordering of the priority within the replicate_queue) often has subtle and unexpected consequences and letting it bake first will help.

erikgrinaker
erikgrinaker previously approved these changes Jul 27, 2023
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but agree with letting this bake on master for a bit.

@kvoli
Copy link
Collaborator Author

kvoli commented Jul 27, 2023

Sounds good. I'll hold off merging.

@kvoli
Copy link
Collaborator Author

kvoli commented Jul 31, 2023

Baked on master for 5 days.

TYFTRs!

@kvoli kvoli force-pushed the backport23.1-107507 branch from 4ac5754 to 7ded956 Compare July 31, 2023 13:59
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 31, 2023

TYFTRs!

Rebased on release-23.1. I'll merge once green.

@kvoli
Copy link
Collaborator Author

kvoli commented Jul 31, 2023

lease-preferences/full-first-preference-down failed on master last night #107862. I'm going to hold off merging until the failure is understood.

@kvoli
Copy link
Collaborator Author

kvoli commented Aug 1, 2023

We understand why #107862 failled. #107967 is a candidate fix.

kvoli and others added 4 commits August 3, 2023 13:17
Previously, there were no e2e tests which assert on lease preference
conformance. This commit adds three `lease-preferences` tests, which
assert that lease preferences are conformed to. The time taken to
conform is also reported in the test logs.

The first test, `.../partial-first-preference-down`, takes down one of
the preferred locality nodes (1/2).

The second test, `.../full-first-preference-down`, takes down both the
preferred locality nodes (2/2).

The third test, `.../manual-violating-transfer`, manually transfers
leases to a locality which violates the configured leases preferences.

Informs: cockroachdb#106100
Epic: none

Release note: None
When a leaseholder is lost, any surviving replica may acquire the lease,
even if it violates lease preferences. There are two main reasons for
this: we need to elect a new Raft leader who will acquire the lease,
which is agnostic to lease preferences, and there may not even be any
surviving replicas that satisfy the lease preferences at all, so we
don't want to keep the range unavailable while we try to figure this out
(considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to
transfer the lease back to a replica that conforms with the preferences,
which can take several minutes. In multi-region clusters, this can cause
severe latency degradation if the lease is acquired in a remote region.

This patch will detect lease preference violations when a replica
acquires a new lease, and eagerly enqueue it in the replicate queue for
transfer (if possible).

This behavior can be turned off (on by default), by the cluster setting
`kv.lease.check_preferences_on_acquisition.enabled`.

Epic: none
Release note (bug fix): When losing a leaseholder and using lease
preferences, the lease can be acquired by any other replica (regardless
of lease preferences) in order to restore availability as soon as
possible. The new leaseholder will now immediately check if it violates
the lease preferences, and attempt to transfer the lease to a replica
that satisfies the preferences if possible.

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
This patch places replicas in the replicate queue purgatory when
it has a lease violating the lease preferences and it's unable to find a
suitable target. This causes the replica to be retried more often.

This will only trigger when replicas are eagerly enqueued (typically
when we acquire a new lease that violates preferences), since we
otherwise don't attempt to enqueue replicas when they don't have a valid
lease transfer target.

This patch also enables requeuing replicas after a successful rebalance,
when the lease violates preferences.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
In cockroachdb#107507, we began eagerly enqueuing into the replicate queue, when
acquiring a replica acquired a new lease which violated lease
preferences. Lease preferences were only considered violated when the
lease itself was valid. In cockroachdb#107507, we saw that it is uncommon, but
possible for an invalid lease to be acquired, violate lease preferences
and not be enqueued as a result. The end result was a leaseholder
violating the applied lease preferences which would not be resolved
until the next scanner cycle.

Update the eager enqueue check to only check that the replica is the
incoming leaseholder when applying the lease, and that the replica
violates the applied lease preferences. The check now applies on any
lease acquisition, where previously it only occurred on the leaseholder
changing.

Note the purgatory error introduced in cockroachdb#107507, still checks that the
lease is valid and owned by the store before proceeding. It is a
condition that the lease must be valid+owned by the store to have a
change planned, so whilst it is possible the lease becomes invalid
somewhere in-between planning, when the replica applies a valid lease,
it will still be enqueued, so purgatory is unnecessary.

Fixes: cockroachdb#107862
Release note: None
@kvoli kvoli force-pushed the backport23.1-107507 branch from 7ded956 to dc05a48 Compare August 3, 2023 13:28
@kvoli kvoli requested a review from erikgrinaker August 3, 2023 13:35
@kvoli
Copy link
Collaborator Author

kvoli commented Aug 3, 2023

I cherry-picked #107967 on top.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kvoli
Copy link
Collaborator Author

kvoli commented Aug 3, 2023

I'm going to merge this tomorrow, barring any test failures overnight on master.

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I would let this bake a little longer on master still since there were new changes. These changes are a little hard to track down and there is not a rush since we aren't going to pull this into a 23.1 release for a few more days still. Maybe merge on Monday? These types of changes are hard to predict the impact as there is often a lot of assumed behavior both in tests and existing code.

@kvoli
Copy link
Collaborator Author

kvoli commented Aug 4, 2023

:lgtm: I would let this bake a little longer on master still since there were new changes. These changes are a little hard to track down and there is not a rush since we aren't going to pull this into a 23.1 release for a few more days still. Maybe merge on Monday? These types of changes are hard to predict the impact as there is often a lot of assumed behavior both in tests and existing code.

I'm thinking there is benefit to getting more bake time on 23.1. I want to take the cautious approach though, merging on Monday SGTM!

Thanks for the reviews, and helping this along. Much appreciated!

@kvoli kvoli merged commit 4bc18b6 into cockroachdb:release-23.1 Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants