Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2.0: kv: reacquire proscribed leases on drain, then transfer #90202

Merged

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Oct 18, 2022

Backport 1/1 commits from #90106 on behalf of @nvanbenschoten.

/cc @cockroachdb/release


Fixes #83372.
Fixes #90022.
Fixes #89963.
Fixes #89962.

This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in transfer-lease roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See #83261 (comment) for more details on how this led to test flakiness in the transfer-lease roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.


Release justification:

Fixes #83372.
Fixes #90022.
Fixes #89963.
Fixes #89962.

This commit instructs stores to reacquire proscribed leases when draining in
order to subsequently transfer them away. This addresses a source of flakiness
in `transfer-lease` roachtests where some lease would not be transferred away
before the drain completed. This could result in range unavailable for up to 9
seconds while other replicas waited out the lease'S expiration. This is because
only the previous leaseholder knows that a proscribed lease is invalid. All
other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain.
However, it became more likely with 034611b. With that change, we began
rejecting lease transfers that were deemed to be "unsafe" more frequently.
034611b introduced a best-effort, graceful version of this check and an
airtight, ungraceful version of the check. The former performs the check before
revoking the outgoing leaseholder's lease while the latter performs the check
after revoking the outgoing leaseholder's lease. In rare cases, it was possible
to hit the airtight, ungraceful check and cause the lease to be proscribed. See
#83261 (comment)
for more details on how this led to test flakiness in the `transfer-lease`
roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.
@blathers-crl blathers-crl bot requested a review from a team as a code owner October 18, 2022 22:01
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-22.2.0-90106 branch from d5860f4 to 06bbac9 Compare October 18, 2022 22:01
@blathers-crl
Copy link
Author

blathers-crl bot commented Oct 18, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Oct 18, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 0ac7e4e into release-22.2.0 Oct 21, 2022
@nvanbenschoten nvanbenschoten deleted the blathers/backport-release-22.2.0-90106 branch October 21, 2022 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants