Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #98116

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 7, 2023

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in Replica.executeAdminCommandWithDescriptor.

This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation:

  1. issue AdminSplit
  2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
  3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the
    replication change that added the voter_incoming to the range
    4(b). lease transfer request delayed and delivered after voter_incoming has
    been transferred the lease, added to the range, then removed from the
    range.

In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a unable to find store error.

@nvanbenschoten nvanbenschoten requested a review from shralex March 7, 2023 03:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@shermanCRL
Copy link
Contributor

Would you speculate that this is PR is good improvement regardless? I.e. even if we can’t observe that it’s causal for a particular issue, we believe this is a good change.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leaseTransferOnJointErr branch 3 times, most recently from 15e3672 to e96ce3a Compare March 7, 2023 21:41
@nvanbenschoten
Copy link
Member Author

I think this is a good improvement regardless, but I would still like to see a demonstration that this resolves the referenced issue. I have that running now, and we should know in about 24h. If that succeeds, I'll mark this PR as "ready for review".

@nvanbenschoten
Copy link
Member Author

I killed the test after about 18 hours because it was OOMing due to manifest size issues. I believe DR is aware of this issue. See https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1678239941563699?thread_ts=1678148317.150499&cid=C2C5FKPPB.

I'll open this for review @shralex, as there's evidence that it resolved the issue.

@nvanbenschoten nvanbenschoten marked this pull request as ready for review March 8, 2023 14:34
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner March 8, 2023 14:34
@nvanbenschoten nvanbenschoten requested a review from a team March 8, 2023 14:34
@@ -106,3 +107,14 @@ var errMarkLeaseTransferRejectedBecauseTargetMayNeedSnapshot = errors.New(
func IsLeaseTransferRejectedBecauseTargetMayNeedSnapshotError(err error) bool {
return errors.Is(err, errMarkLeaseTransferRejectedBecauseTargetMayNeedSnapshot)
}

// IsLeaseTransferRejectedBecauseTargetCannotReceiveLease detects whether an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it a bit more concise, e.g.,

IsLeaseTransferRejectedBecauseTargetCannotReceiveLease returns true if err (assumed to have been emitted by the current leaseholder when processing a lease transfer request) indicates that the target replica is not qualified to receive the lease. This could be because the current leaseholder has an outdated view of the target replica's state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// that the lease transfer failed because the current leaseholder determined
// that the lease transfer target cannot is not qualified to receive the lease.
// This can be an indication that the current leaseholder has an outdated view
// of the lease transfer target's state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an indication that the current leaseholder has an outdated view

What could be the other reasons for this error ? In particular, is there a situation where the error indicates a problem that isn't transient, so we shouldn't be retrying ? for example, could it be that the state of the current leaseholder is up-to-date and the target replica was already removed ? or that ErrReplicaCannotHoldLease legitimately represents an error ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any case where the error should not lead to the replication change being retried. Note that this doesn't necessarily mean that the lease transfer should be retried, but that maybeLeaveAtomicChangeReplicas should be called again and the caller should recompute whether a lease transfer is needed. If the caller again finds a VOTER_INCOMING in the descriptor, it should try the lease transfer again.

…etryable

Fixes cockroachdb#96746.
Fixes cockroachdb#100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leaseTransferOnJointErr branch from e96ce3a to c3519fe Compare April 4, 2023 18:40
@nvanbenschoten nvanbenschoten added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Apr 4, 2023
Copy link
Contributor

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@nvanbenschoten
Copy link
Member Author

bors r=shralex

@craig
Copy link
Contributor

craig bot commented Apr 7, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Apr 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from c3519fe to blathers/backport-release-22.2-98116: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: restore/tpce/32TB/aws/nodes=15/cpus=16 failed backupccl: restore fails when issuing an AdminSplit
4 participants