kv: TestSnapshotsToDrainingNodes
spins on "retryable" snapshot error for 45s
#87337
Labels
A-kv-distribution
Relating to rebalancing and leasing.
C-test-failure
Broken test (automatically or manually discovered).
In #77951, we saw that
TestSnapshotsToDrainingNodes
was one of the slowest tests inpkg/kv/kvserver
. I saw the same thing in a recent CI run, where the test was so slow that it timed out and flaked.Digging in, I noticed that the "store is draining" errors are marked with the
errMarkSnapshotError
marker. This causes it to be classified as a retryable error byIsRetriableReplicationChangeError
. As a result, whenTestCluster.changeReplicas
sees this error, it spins inSucceedsSoonError
for 45s (DefaultSucceedsSoonDuration
):cockroach/pkg/testutils/testcluster/testcluster.go
Lines 736 to 759 in 49b6501
I don't quite know what the right solution is here. We should look into this and find out. Is it ok for the "store is draining" error to be considered retryable? Do we need a form of permanent snapshot errors instead of considering all to be transient? Could this cause real consequences elsewhere (e.g. in the
replicateQueue
, where we consultisSnapshotError
). If not, does the test need to be adjusted to avoid spinning?Jira issue: CRDB-19280
The text was updated successfully, but these errors were encountered: