Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot #85405

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Aug 1, 2022

Informs #84635.
Informs #84162.
Fixes #85449.
Fixes #83174.

This commit considers the LeaseTransferRejectedBecauseTargetMayNeedSnapshot
status to be a form of a retriable replication change error. It then hooks
Replica.executeAdminCommandWithDescriptor up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:

  1. issue AdminSplit
  2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
  3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
  4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None

Release justification: Low risk, resolves flaky test.

@nvanbenschoten nvanbenschoten requested a review from shralex August 1, 2022 14:23
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten marked this pull request as ready for review August 1, 2022 14:23
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner August 1, 2022 14:23
Copy link
Contributor

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)


-- commits line 13 at r1:
Could be nice to have a test of this scenario or perhaps a simpler one (but I'll leave it to you to decide if one is needed)


pkg/kv/kvserver/replica_command.go line 564 at r1 (raw file):

		// On seeing a retryable replication change or an AmbiguousResultError,
		// retry the command with the updated descriptor.
		if !IsRetriableReplicationChangeError(lastErr) &&

Just wanted to double check whether the change for ConditionFailedError is safe to do here ? is it a retriable error ?

Code quote:

On seeing a ConditionFailedError or an AmbiguousResultError

@msbutler
Copy link
Collaborator

msbutler commented Aug 24, 2022

@nvanbenschoten friendly ping on this! Any reason not to merge this before the 22.2 branchcut and backport to 22.1?

…shot

Informs cockroachdb#84635
Informs cockroachdb#84162
Fixes cockroachdb#85449.
Fixes cockroachdb#83174.

This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
status to be a form of a retriable replication change error. It then hooks
`Replica.executeAdminCommandWithDescriptor` up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None

Release justification: Low risk.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/retrySplitAfterUnsafeLease branch from db2d653 to 7778dd8 Compare August 30, 2022 18:31
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shralex I added a unit test here, along with a second commit that adds some improved testing through kvnemesis.

I also found that this resolves #85449 and #83174.

Please take another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


-- commits line 13 at r1:

Previously, shralex wrote…

Could be nice to have a test of this scenario or perhaps a simpler one (but I'll leave it to you to decide if one is needed)

Done.


pkg/kv/kvserver/replica_command.go line 564 at r1 (raw file):

Previously, shralex wrote…

Just wanted to double check whether the change for ConditionFailedError is safe to do here ? is it a retriable error ?

The ConditionFailedError case is now considered by IsRetriableReplicationChangeError. So this has the same effect for those types of errors.

Copy link
Contributor

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test, a cool use of filters!

Reviewed 1 of 2 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)


pkg/kv/kvnemesis/validator.go line 498 at r3 (raw file):

		if err := errorFromResult(t.Result); err != nil {
			ignore = kvserver.IsLeaseTransferRejectedBecauseTargetMayNeedSnapshotError(err) ||
				// Only VOTER_FULL replicas can currently hold a range lease.

This explanation seems no longer correct.

…rror

Related to cockroachdb#87121.

This commit adds the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
error status to the set of expected error types that can be returned to a
`TransferLeaseOperation`.

This commit, along with the previous, deflakes `TestKVNemesisMultiNode`
when run with the following diff:
```
diff --git a/pkg/kv/kvserver/raftutil/util.go b/pkg/kv/kvserver/raftutil/util.go
@@ -181,5 +183,8 @@ func ReplicaMayNeedSnapshot(
        // 2. that we do not perform a log truncation between now and when our action
        //    goes into effect. In practice, this means serializing with Raft log
        //    truncation operations using latching.
+       if rand.Intn(10) == 0 {
+               return ReplicaStateProbe
+       }
        return NoSnapshotNeeded
 }
```
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/retrySplitAfterUnsafeLease branch from 7778dd8 to 6ee6501 Compare August 31, 2022 17:18
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=shralex

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


pkg/kv/kvnemesis/validator.go line 498 at r3 (raw file):

Previously, shralex wrote…

This explanation seems no longer correct.

Good catch. Done.

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build succeeded:

@craig craig bot merged commit da00c3a into cockroachdb:master Aug 31, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/retrySplitAfterUnsafeLease branch September 8, 2022 17:56
@msbutler

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: schemachange/during/kv failed roachtest: kv95/enc=false/nodes=4/ssds=8 failed
4 participants