-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot #85405
kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot #85405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
-- commits
line 13 at r1:
Could be nice to have a test of this scenario or perhaps a simpler one (but I'll leave it to you to decide if one is needed)
pkg/kv/kvserver/replica_command.go
line 564 at r1 (raw file):
// On seeing a retryable replication change or an AmbiguousResultError, // retry the command with the updated descriptor. if !IsRetriableReplicationChangeError(lastErr) &&
Just wanted to double check whether the change for ConditionFailedError is safe to do here ? is it a retriable error ?
Code quote:
On seeing a ConditionFailedError or an AmbiguousResultError
@nvanbenschoten friendly ping on this! Any reason not to merge this before the 22.2 branchcut and backport to 22.1? |
…shot Informs cockroachdb#84635 Informs cockroachdb#84162 Fixes cockroachdb#85449. Fixes cockroachdb#83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk.
db2d653
to
7778dd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shralex I added a unit test here, along with a second commit that adds some improved testing through kvnemesis.
I also found that this resolves #85449 and #83174.
Please take another look.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shralex)
Previously, shralex wrote…
Could be nice to have a test of this scenario or perhaps a simpler one (but I'll leave it to you to decide if one is needed)
Done.
pkg/kv/kvserver/replica_command.go
line 564 at r1 (raw file):
Previously, shralex wrote…
Just wanted to double check whether the change for ConditionFailedError is safe to do here ? is it a retriable error ?
The ConditionFailedError case is now considered by IsRetriableReplicationChangeError
. So this has the same effect for those types of errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test, a cool use of filters!
Reviewed 1 of 2 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
pkg/kv/kvnemesis/validator.go
line 498 at r3 (raw file):
if err := errorFromResult(t.Result); err != nil { ignore = kvserver.IsLeaseTransferRejectedBecauseTargetMayNeedSnapshotError(err) || // Only VOTER_FULL replicas can currently hold a range lease.
This explanation seems no longer correct.
…rror Related to cockroachdb#87121. This commit adds the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` error status to the set of expected error types that can be returned to a `TransferLeaseOperation`. This commit, along with the previous, deflakes `TestKVNemesisMultiNode` when run with the following diff: ``` diff --git a/pkg/kv/kvserver/raftutil/util.go b/pkg/kv/kvserver/raftutil/util.go @@ -181,5 +183,8 @@ func ReplicaMayNeedSnapshot( // 2. that we do not perform a log truncation between now and when our action // goes into effect. In practice, this means serializing with Raft log // truncation operations using latching. + if rand.Intn(10) == 0 { + return ReplicaStateProbe + } return NoSnapshotNeeded } ```
7778dd8
to
6ee6501
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=shralex
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shralex)
pkg/kv/kvnemesis/validator.go
line 498 at r3 (raw file):
Previously, shralex wrote…
This explanation seems no longer correct.
Good catch. Done.
Build succeeded: |
Informs #84635.
Informs #84162.
Fixes #85449.
Fixes #83174.
This commit considers the
LeaseTransferRejectedBecauseTargetMayNeedSnapshot
status to be a form of a retriable replication change error. It then hooks
Replica.executeAdminCommandWithDescriptor
up to consult this status in itsretry loop.
This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
Release note: None
Release justification: Low risk, resolves flaky test.