kv: `TestSnapshotsToDrainingNodes` spins on "retryable" snapshot error for 45s #87337

nvanbenschoten · 2022-09-02T16:35:52Z

In #77951, we saw that TestSnapshotsToDrainingNodes was one of the slowest tests in pkg/kv/kvserver. I saw the same thing in a recent CI run, where the test was so slow that it timed out and flaked.

Digging in, I noticed that the "store is draining" errors are marked with the errMarkSnapshotError marker. This causes it to be classified as a retryable error by IsRetriableReplicationChangeError. As a result, when TestCluster.changeReplicas sees this error, it spins in SucceedsSoonError for 45s (DefaultSucceedsSoonDuration):

cockroach/pkg/testutils/testcluster/testcluster.go

Lines 736 to 759 in 49b6501

    
           if err := testutils.SucceedsSoonError(func() error { 
        
           	tc.t.Helper() 
        
           	var beforeDesc roachpb.RangeDescriptor 
        
           	if err := tc.Servers[0].DB().GetProto( 
        
           		ctx, keys.RangeDescriptorKey(startKey), &beforeDesc, 
        
           	); err != nil { 
        
           		return errors.Wrap(err, "range descriptor lookup error") 
        
           	} 
        
           	var err error 
        
           	desc, err = tc.Servers[0].DB().AdminChangeReplicas( 
        
           		ctx, startKey.AsRawKey(), beforeDesc, roachpb.MakeReplicationChanges(changeType, targets...), 
        
           	) 
        
           	if kvserver.IsRetriableReplicationChangeError(err) { 
        
           		tc.t.Logf("encountered retriable replication change error: %v", err) 
        
           		return err 
        
           	} 
        
           	// Don't return blindly - if this isn't an error we think is related to a 
        
           	// replication error that we can retry, save the error to the outer scope 
        
           	// and return nil. 
        
           	returnErr = err 
        
           	return nil 
        
           }); err != nil { 
        
           	returnErr = err 
        
           }

I don't quite know what the right solution is here. We should look into this and find out. Is it ok for the "store is draining" error to be considered retryable? Do we need a form of permanent snapshot errors instead of considering all to be transient? Could this cause real consequences elsewhere (e.g. in the replicateQueue, where we consult isSnapshotError). If not, does the test need to be adjusted to avoid spinning?

Jira issue: CRDB-19280

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2022-09-02T17:35:08Z

This is supposed to be fixed by #75248. @tbg any chance you might be able to push that over the finish line soon?

tbg · 2022-09-05T12:45:36Z

Urgh, yes, moved it into the Sep milestone and should get to it. Thanks for the heads up.

blathers-crl · 2022-09-12T16:42:50Z

cc @cockroachdb/replication

nvanbenschoten added C-test-failure Broken test (automatically or manually discovered). A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Sep 2, 2022

nvanbenschoten assigned aayushshah15 Sep 2, 2022

tbg mentioned this issue Sep 5, 2022

kvserver: use EncodedError in SnapshotResponse #75248

Merged

mwang1026 assigned tbg and unassigned aayushshah15 Sep 12, 2022

exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Sep 12, 2022

craig bot closed this as completed in d2171af Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: `TestSnapshotsToDrainingNodes` spins on "retryable" snapshot error for 45s #87337

kv: `TestSnapshotsToDrainingNodes` spins on "retryable" snapshot error for 45s #87337

nvanbenschoten commented Sep 2, 2022 •

edited by cockroach-jira-scripts

Loading

aayushshah15 commented Sep 2, 2022

tbg commented Sep 5, 2022

blathers-crl bot commented Sep 12, 2022

kv: TestSnapshotsToDrainingNodes spins on "retryable" snapshot error for 45s #87337

kv: TestSnapshotsToDrainingNodes spins on "retryable" snapshot error for 45s #87337

Comments

nvanbenschoten commented Sep 2, 2022 • edited by cockroach-jira-scripts Loading

aayushshah15 commented Sep 2, 2022

tbg commented Sep 5, 2022

blathers-crl bot commented Sep 12, 2022

kv: `TestSnapshotsToDrainingNodes` spins on "retryable" snapshot error for 45s #87337

kv: `TestSnapshotsToDrainingNodes` spins on "retryable" snapshot error for 45s #87337

nvanbenschoten commented Sep 2, 2022 •

edited by cockroach-jira-scripts

Loading