-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #98116
kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #98116
Conversation
Would you speculate that this is PR is good improvement regardless? I.e. even if we can’t observe that it’s causal for a particular issue, we believe this is a good change. |
15e3672
to
e96ce3a
Compare
I think this is a good improvement regardless, but I would still like to see a demonstration that this resolves the referenced issue. I have that running now, and we should know in about 24h. If that succeeds, I'll mark this PR as "ready for review". |
I killed the test after about 18 hours because it was OOMing due to manifest size issues. I believe DR is aware of this issue. See https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1678239941563699?thread_ts=1678148317.150499&cid=C2C5FKPPB. I'll open this for review @shralex, as there's evidence that it resolved the issue. |
pkg/kv/kvserver/markers.go
Outdated
@@ -106,3 +107,14 @@ var errMarkLeaseTransferRejectedBecauseTargetMayNeedSnapshot = errors.New( | |||
func IsLeaseTransferRejectedBecauseTargetMayNeedSnapshotError(err error) bool { | |||
return errors.Is(err, errMarkLeaseTransferRejectedBecauseTargetMayNeedSnapshot) | |||
} | |||
|
|||
// IsLeaseTransferRejectedBecauseTargetCannotReceiveLease detects whether an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it a bit more concise, e.g.,
IsLeaseTransferRejectedBecauseTargetCannotReceiveLease returns true if err (assumed to have been emitted by the current leaseholder when processing a lease transfer request) indicates that the target replica is not qualified to receive the lease. This could be because the current leaseholder has an outdated view of the target replica's state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/kv/kvserver/markers.go
Outdated
// that the lease transfer failed because the current leaseholder determined | ||
// that the lease transfer target cannot is not qualified to receive the lease. | ||
// This can be an indication that the current leaseholder has an outdated view | ||
// of the lease transfer target's state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be an indication that the current leaseholder has an outdated view
What could be the other reasons for this error ? In particular, is there a situation where the error indicates a problem that isn't transient, so we shouldn't be retrying ? for example, could it be that the state of the current leaseholder is up-to-date and the target replica was already removed ? or that ErrReplicaCannotHoldLease legitimately represents an error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any case where the error should not lead to the replication change being retried. Note that this doesn't necessarily mean that the lease transfer should be retried, but that maybeLeaveAtomicChangeReplicas
should be called again and the caller should recompute whether a lease transfer is needed. If the caller again finds a VOTER_INCOMING in the descriptor, it should try the lease transfer again.
…etryable Fixes cockroachdb#96746. Fixes cockroachdb#100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
e96ce3a
to
c3519fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
bors r=shralex |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from c3519fe to blathers/backport-release-22.2-98116: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Fixes #96746.
Fixes #100379.
This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in
Replica.executeAdminCommandWithDescriptor
.This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation:
replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
been transferred the lease, added to the range, then removed from the
range.
In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds.
Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a
unable to find store
error.