Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: TestSnapshotsToDrainingNodes spins on "retryable" snapshot error for 45s #87337

Closed
nvanbenschoten opened this issue Sep 2, 2022 · 3 comments · Fixed by #75248
Closed
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-test-failure Broken test (automatically or manually discovered).

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 2, 2022

In #77951, we saw that TestSnapshotsToDrainingNodes was one of the slowest tests in pkg/kv/kvserver. I saw the same thing in a recent CI run, where the test was so slow that it timed out and flaked.

Digging in, I noticed that the "store is draining" errors are marked with the errMarkSnapshotError marker. This causes it to be classified as a retryable error by IsRetriableReplicationChangeError. As a result, when TestCluster.changeReplicas sees this error, it spins in SucceedsSoonError for 45s (DefaultSucceedsSoonDuration):

if err := testutils.SucceedsSoonError(func() error {
tc.t.Helper()
var beforeDesc roachpb.RangeDescriptor
if err := tc.Servers[0].DB().GetProto(
ctx, keys.RangeDescriptorKey(startKey), &beforeDesc,
); err != nil {
return errors.Wrap(err, "range descriptor lookup error")
}
var err error
desc, err = tc.Servers[0].DB().AdminChangeReplicas(
ctx, startKey.AsRawKey(), beforeDesc, roachpb.MakeReplicationChanges(changeType, targets...),
)
if kvserver.IsRetriableReplicationChangeError(err) {
tc.t.Logf("encountered retriable replication change error: %v", err)
return err
}
// Don't return blindly - if this isn't an error we think is related to a
// replication error that we can retry, save the error to the outer scope
// and return nil.
returnErr = err
return nil
}); err != nil {
returnErr = err
}

I don't quite know what the right solution is here. We should look into this and find out. Is it ok for the "store is draining" error to be considered retryable? Do we need a form of permanent snapshot errors instead of considering all to be transient? Could this cause real consequences elsewhere (e.g. in the replicateQueue, where we consult isSnapshotError). If not, does the test need to be adjusted to avoid spinning?

Jira issue: CRDB-19280

@nvanbenschoten nvanbenschoten added C-test-failure Broken test (automatically or manually discovered). A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Sep 2, 2022
@aayushshah15
Copy link
Contributor

This is supposed to be fixed by #75248. @tbg any chance you might be able to push that over the finish line soon?

@tbg
Copy link
Member

tbg commented Sep 5, 2022

Urgh, yes, moved it into the Sep milestone and should get to it. Thanks for the heads up.

@blathers-crl
Copy link

blathers-crl bot commented Sep 12, 2022

cc @cockroachdb/replication

@craig craig bot closed this as completed in d2171af Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants