Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: always transfer expiration-based leases #85629

Merged

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Aug 4, 2022

Fixes #81764.

In addition to ranges that unconditionally require expiration-based
leases (node liveness and earlier), we now also use them during lease
transfers for all other ranges. After acquiring such expiration-based
leases, the leaseholders are expected to soon upgrade them to the more
efficient epoch-based ones. By transferring an expiration-based lease,
we can limit the effect of an ill-advised lease transfer since the incoming
leaseholder needs to recognize itself as such within a few
seconds; if it doesn't (we accidentally sent the lease to a replica in
need of a snapshot), the lease is up for grabs. If we simply transferred
epoch based leases, it would be possible for the new leaseholder in need
of a snapshot to maintain its lease if the node it was on is able to
heartbeat its liveness record.

Release note: None.
Release justification: stability work

@irfansharif irfansharif requested a review from a team as a code owner August 4, 2022 18:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif marked this pull request as draft August 4, 2022 18:13
@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from 1fa26f1 to ee3d18f Compare August 5, 2022 19:54
@irfansharif irfansharif changed the title [wip] kvserver: always transfer expiration based leases … kvserver: always transfer expiration-based leases Aug 5, 2022
@irfansharif irfansharif marked this pull request as ready for review August 5, 2022 19:54
@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch 4 times, most recently from 62a6a38 to b47f808 Compare August 6, 2022 04:10
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/client_split_test.go line 2751 at r1 (raw file):

	// Increment the manual clock and do a write to increase the qps above zero.
	manualClock.Increment(int64(replicastats.MinStatsDuration))
	testutils.SucceedsSoon(t, func() error {

Why is this now needed?


pkg/kv/kvserver/replica_proposal.go line 351 at r1 (raw file):

	// in serializability violations.
	r.mu.state.Lease = newLease
	requiresExpirationBasedLease := r.requiresExpiringLeaseRLocked()

It's odd that this is checking an attribute of the range instead of just checking whether the newly acquired lease is expiration-based or not. Could this be newLease.Type() == roachpb.LeaseExpiration? Doing that would enqueue the lease to be renewed


pkg/kv/kvserver/replica_range_lease.go line 258 at r1 (raw file):

		// more efficient epoch-based ones. But by transferring an
		// expiration-based lease, we can limit the effect of an ill-advised
		// lease transfer since the in coming leaseholder needs to recognize

s/in coming/incoming/


pkg/kv/kvserver/replica_range_lease.go line 260 at r1 (raw file):

		// lease transfer since the in coming leaseholder needs to recognize
		// itself as such within a few seconds; if it doesn't (we accidentally
		// sent the lease to a replica in need of a snapshot), the lease is up

"in need of a snapshot or far behind on its log"


pkg/kv/kvserver/replica_range_lease.go line 262 at r1 (raw file):

		// sent the lease to a replica in need of a snapshot), the lease is up
		// for grabs. If we simply transferred epoch based leases, it's possible
		// for the new leaseholder in need of a snapshot to maintain its lease

s/in need of a snapshot/that is delayed in applying the lease transfer/


pkg/kv/kvserver/replica_range_lease.go line 1083 at r1 (raw file):

) error {
	var leaseRenewal time.Duration
	if r.leaseStatusAtRLocked(ctx, now).Lease.Type() == roachpb.LeaseExpiration {

I don't think we want to call leaseStatusAtRLocked again. We could pass the lease in from the caller, but it may be invalid.

I think we should actually keep the logic as it was. This is a sanity check that avoids an infinite lease acquisition loop if a request is so far in the future that it wouldn't be satisfiable even after the current lease was renewed. We know that after a lease transfer, a !requiresExpiringLeaseRLocked() range will upgrade to an epoch-based lease on renewal.


pkg/kv/kvserver/replica_rangefeed_test.go line 1287 at r1 (raw file):

	// Run up the clock to upgrade the expiration based lease to an epoch based
	// one.

Consider mentioning that this is needed because the test intends to expire the lease by pausing node liveness.

@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from b47f808 to b1a0828 Compare August 10, 2022 21:49
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/client_split_test.go line 2751 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this now needed?

Flakey now with the following; the proposal itself observes a lease upgrade underneath it:

    client_split_test.go:2753: [NotLeaseHolderError] stale proposal: command was proposed under lease #2 but is being applied under lease: repl=(n2,s2):2 seq=3 start=1660158045.473318000,1 epo=1 pro=1660158050.473837000,0; r52: replica not lease holder; current lease is repl=(n2,s2):2 seq=3 start=1660158045.473318000,1 epo=1 pro=1660158050.473837000,0

pkg/kv/kvserver/replica_proposal.go line 351 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's odd that this is checking an attribute of the range instead of just checking whether the newly acquired lease is expiration-based or not. Could this be newLease.Type() == roachpb.LeaseExpiration? Doing that would enqueue the lease to be renewed

Wasn't sure here, figured if we didn't, we'd still renew (+ upgrade) on the first request the replica received. But I've added this tracking (and subsequent untracking once the lease does in fact change to an epoch based one). I also added a check in TestLeaseTransfersUseExpirationLeasesAndBumpToEpochBasedOnes to make sure the untracking (and consequently, tracking) occurs after a renewal attempt. PTAL.


pkg/kv/kvserver/replica_range_lease.go line 1083 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think we want to call leaseStatusAtRLocked again. We could pass the lease in from the caller, but it may be invalid.

I think we should actually keep the logic as it was. This is a sanity check that avoids an infinite lease acquisition loop if a request is so far in the future that it wouldn't be satisfiable even after the current lease was renewed. We know that after a lease transfer, a !requiresExpiringLeaseRLocked() range will upgrade to an epoch-based lease on renewal.

Reverted. I understand what this check is for so keeping it as is fine, but could you explain why we don't like like peeking into the current lease type -- i.e. why's leaseStatusAtRLocked inappropriate here? (didn't look that expensive.)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1, 5 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/client_split_test.go line 2751 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Flakey now with the following; the proposal itself observes a lease upgrade underneath it:

    client_split_test.go:2753: [NotLeaseHolderError] stale proposal: command was proposed under lease #2 but is being applied under lease: repl=(n2,s2):2 seq=3 start=1660158045.473318000,1 epo=1 pro=1660158050.473837000,0; r52: replica not lease holder; current lease is repl=(n2,s2):2 seq=3 start=1660158045.473318000,1 epo=1 pro=1660158050.473837000,0

Ah, I see. Because we're passing through the store's TestSender and not a real DistSender, so NotLeaseHolderErrors aren't retired.


pkg/kv/kvserver/replica_range_lease.go line 1083 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Reverted. I understand what this check is for so keeping it as is fine, but could you explain why we don't like like peeking into the current lease type -- i.e. why's leaseStatusAtRLocked inappropriate here? (didn't look that expensive.)

I was thinking that it would require an additional mutex acquisition and risk inconsistency with the previous call, but I see now that we already hold the RLock.

The other reason is that the lease might not even be valid at this point, so we can't trust the lease status.


pkg/kv/kvserver/store.go line 2250 at r2 (raw file):

					}
					s.renewableLeases.Delete(k)
				} else if leaseStatus.Lease.Type() == roachpb.LeaseEpoch {

Good catch!

Now I'm confused about what happens when a range in this map is destroyed and removed from a node. Is it also removed from the map? I don't see where that happens. Do we rely on the next call to redirectOnOrAcquireLease returning an error and that causing us to hit the s.renewableLeases.Delete(k) path above?

Maybe we don't see issues with this in practice because a leaseholder will never remove itself, and we'll usually hit a NotLeaseHolderError above when we periodically try to renew the lease before we're removed.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/store.go line 2250 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good catch!

Now I'm confused about what happens when a range in this map is destroyed and removed from a node. Is it also removed from the map? I don't see where that happens. Do we rely on the next call to redirectOnOrAcquireLease returning an error and that causing us to hit the s.renewableLeases.Delete(k) path above?

Maybe we don't see issues with this in practice because a leaseholder will never remove itself, and we'll usually hit a NotLeaseHolderError above when we periodically try to renew the lease before we're removed.

This all seems very fragile (but has been this way since '18, so maybe not), what do you want to do here? With this PR specifically, we're increasing the likelihood that this map contains a replica that's destroyed from the store. So we could just not track these replicas-temporarily-using-expiration-leases in this map for now, until it's made more robust. We're relying on an incoming timely request to renew + upgrade the lease I think. What happens if no such request comes in? It's up for grabs until the next request is routed ~ somewhere?

@irfansharif
Copy link
Contributor Author

(Gentle nudge, @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/store.go line 2250 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This all seems very fragile (but has been this way since '18, so maybe not), what do you want to do here? With this PR specifically, we're increasing the likelihood that this map contains a replica that's destroyed from the store. So we could just not track these replicas-temporarily-using-expiration-leases in this map for now, until it's made more robust. We're relying on an incoming timely request to renew + upgrade the lease I think. What happens if no such request comes in? It's up for grabs until the next request is routed ~ somewhere?

I think we can add s.renewableLeases.Delete(rangeID) to Store.unlinkReplicaByRangeIDLocked and then feel better about this. We should probably do that.

However, I am now a little concerned about what happens when this map gets too large (e.g. after a flood of new leases). Could that prevent a timely extension of an important lease like node liveness?

What do you think about skipping this map and promoting the lease from expiration to epoch immediately upon its reception? For instance, we could call maybeExtendLeaseAsync (or a new maybeExtendLeaseAsyncLocked) directly in leasePostApplyLocked when we have an expiration-based lease but !requiresExpiringLeaseRLocked.

@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from b1a0828 to e521bc2 Compare August 19, 2022 16:07
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/store.go line 2250 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we can add s.renewableLeases.Delete(rangeID) to Store.unlinkReplicaByRangeIDLocked and then feel better about this. We should probably do that.

However, I am now a little concerned about what happens when this map gets too large (e.g. after a flood of new leases). Could that prevent a timely extension of an important lease like node liveness?

What do you think about skipping this map and promoting the lease from expiration to epoch immediately upon its reception? For instance, we could call maybeExtendLeaseAsync (or a new maybeExtendLeaseAsyncLocked) directly in leasePostApplyLocked when we have an expiration-based lease but !requiresExpiringLeaseRLocked.

That makes more sense, I'd prefer mucking with this map for all the reasons listed above. Upgrading the lease immediately after applying it instead of waiting for the first request to the new leaseholder also makes sense, so I've done that now -- PTAL.

I'll send the renewableLeases map cleanup in a follow up PR. It's trivial but I'm being paranoid about getting just this one change in to start. That map is at most a few keys today since only ranges including and before liveness are tracked in there, so not an immediate concern.

@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from e521bc2 to 5dd866d Compare August 19, 2022 16:19
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/kvserver/replica_proposal.go line 361 at r3 (raw file):

	// before it expires.
	if leaseChangingHands && iAmTheLeaseHolder && expirationBasedLease && r.ownsValidLeaseRLocked(ctx, now) {
		r.store.renewableLeases.Store(int64(r.RangeID), unsafe.Pointer(r))

Did we want to get rid of this?


pkg/kv/kvserver/replica_proposal.go line 368 at r3 (raw file):

	}

	if hasExpirationBasedLease && !requiresExpirationBasedLease {

Should this also have an && r.ownsValidLeaseRLocked(ctx, now) { condition, to guard against the case where the lease is applied after it has expired?

Also, does it need an iAmTheLeaseHolder condition? The other two branches have that in addition to ownsValidLeaseRLocked, though that feels redundant.

@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from 5dd866d to 463258d Compare August 19, 2022 17:37
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_proposal.go line 361 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did we want to get rid of this?

Gah, no. Reverted. Sorry for being sloppy.


pkg/kv/kvserver/replica_proposal.go line 368 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this also have an && r.ownsValidLeaseRLocked(ctx, now) { condition, to guard against the case where the lease is applied after it has expired?

Also, does it need an iAmTheLeaseHolder condition? The other two branches have that in addition to ownsValidLeaseRLocked, though that feels redundant.

Added the expiration check + a comment. The iAmTheLeaseHolder condition felt redundant for that same reason. I'll keep the conditional as is but added a test-only assertion to see if it in fact is redundant.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/kvserver/replica_proposal.go line 371 at r4 (raw file):

	}

	if hasExpirationBasedLease && !requiresExpirationBasedLease && r.ownsValidLeaseRLocked(ctx, now) {

nit: depending on whether you want another pass through CI, it might be easier to understand this structure if it was written as:

if leaseChangingHands && iAmTheLeaseHolder && hasExpirationBasedLease && r.ownsValidLeaseRLocked(ctx, now) {
    if requiresExpirationBasedLease {
        r.store.renewableLeases.Store(int64(r.RangeID), unsafe.Pointer(r))
        ...
    } else {
        ...
        r.maybeExtendLeaseAsyncLocked(ctx, st)
    }
}

@irfansharif irfansharif force-pushed the 220802.expiration-based-lease-transfer branch from 463258d to fa1b387 Compare August 20, 2022 05:13
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Will bors on green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_proposal.go line 371 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: depending on whether you want another pass through CI, it might be easier to understand this structure if it was written as:

if leaseChangingHands && iAmTheLeaseHolder && hasExpirationBasedLease && r.ownsValidLeaseRLocked(ctx, now) {
    if requiresExpirationBasedLease {
        r.store.renewableLeases.Store(int64(r.RangeID), unsafe.Pointer(r))
        ...
    } else {
        ...
        r.maybeExtendLeaseAsyncLocked(ctx, st)
    }
}

Done. Skewed with a test Erik added in #84865, it needed to be updated to similarly run up a clock to upgrade a transferred expiration lease + increment liveness epoch's in a more robust way.

Fixes cockroachdb#81764.

In addition to ranges that unconditionally require expiration-based
leases (node liveness and earlier), we also use them during lease
transfers for all other ranges. After acquiring such expiration-based
leases, the leaseholders are expected to soon upgrade them to the more
efficient epoch-based ones. By transferring an expiration-based lease,
we can limit the effect of an ill-advised lease transfer since the in
coming leaseholder needs to recognize itself as such within a few
seconds; if it doesn't (we accidentally sent the lease to a replica in
need of a snapshot), the lease is up for grabs. If we simply transferred
epoch based leases, it would be possible for the new leaseholder in need
of a snapshot to maintain its lease if the node it was on is able to
heartbeat its liveness record.

Release note: None
Release justification: low risk, high benefit change
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 20, 2022

Build succeeded:

@craig craig bot merged commit 0c9ac43 into cockroachdb:master Aug 20, 2022
@irfansharif irfansharif deleted the 220802.expiration-based-lease-transfer branch August 20, 2022 23:48
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Aug 23, 2022
Accidentally included as part of cockroachdb#85629.

Release note: None
Release justification: logging fix
tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2022
Randomly saw this. It was accidentally introduced in cockroachdb#85629.

Release justification: low-risk logging fix
Release note: None
craig bot pushed a commit that referenced this pull request Aug 23, 2022
86644: kvserver: remove accidentally committed debug logging r=erikgrinaker a=tbg

Randomly saw this. It was accidentally introduced in #85629.

Release justification: low-risk logging fix
Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 20, 2022
In cockroachdb#85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 20, 2022
In cockroachdb#85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 20, 2022
85859: pgwire: Add support for cursors with special characters r=knz,rafiss a=rimadeodhar

IIn CockroachDB and Postgres, it is possible to declare
cursors with special characters enclosed within double
quotes, for e.g. "1-2-3". Currently, we store the name
as an unescaped string which causes problems during the
pgwire DESCRIBE step for looking up the cursor. We should
be storing using the tree.Name datatype for the cursor name
while storing and looking up cursors. This PR updates the code
to start using tree.Name instead of raw strings for handling
cursor names. This fixes the issue where the pgwire DESCRIBE
step fails while attempting to look up cursors with names
containing special characters.

Resolves #84261

Release note (bug fix): The pgwire DESCRIBE step no longer
fails with an error while attempting to look up cursors
declared with names containing special characters.

86492: bulk: perform meta lookup on range cache miss during index backfill r=dt a=nvanbenschoten

Fixes #84290.

This commit addresses the sustained slowdown described in #84290 by replacing
the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to
`RangeCache.Lookup`. The former method checks the cache but returns no range
descriptor if the cache currently has no overlapping key. This is possible if
the descriptor was recently evicted because it was stale. The later method
performs a meta lookup if the cache currently has no overlapping key, so it
should always return _some_ range descriptor.

There's a question of whether we should be logging a warning but proceeding if
this meta lookup fails. For now, I've decided not to change that behavior.

Release justification: None. Don't merge yet.

87885: kv: remove broken attempt to reject lease acquisitions on draining nodes r=nvanbenschoten a=nvanbenschoten

Related to #83261.

This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None

88205: kvserver: (partially) deflake transfer-leases/drain-other-node r=irfansharif a=irfansharif

In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease.

In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process.

Release note: None

Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Sep 21, 2022
In #85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In #83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this pull request Apr 19, 2023
Starting a lease as an epoch lease is risky. The problem is the lease
may be marked as transferred to the new node, but the node can take time
to realize it is the leaseholder. THis was partially addressed as part
of cockroachdb#85629, however there were still cases where this can occur.
Specifically in the cases of non-cooperative lease acquisition. While
we don't have an example of this causing a problem because the range is
likely idle at this point, it is certainly possible to make a contrived
case where this happens.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: always transfer expiration-based leases during lease transfers
3 participants