-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: de-flake TestReplicateQueueUpAndDownReplicateNonVoters #75212
kvserver: de-flake TestReplicateQueueUpAndDownReplicateNonVoters #75212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @irfansharif)
pkg/kv/kvserver/replicate_queue_test.go, line 322 at r1 (raw file):
scratchRange.GetRangeID()) if err := row.Scan(&numNonVoters); err != nil { if testutils.IsError(err, "converting NULL to int is unsupported") {
nit: SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1
then you don't have to handle the NULL result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @irfansharif)
Fixes cockroachdb#75135. This test asserted on span configs applying to a scratch range. When stressing, it appeared that some time we were not seeing the scratch range adopt the prescribed number of voters/non-voters. Staring at the test itself, we were only nudging the replication queues for the first node in the three node test. It's possible for the scratch range to have been housed on a node other than the first; this commit makes it so that the test nudges queues on all nodes. For good measure, lets also ensure that the split queues process everything, ditto for the snapshot queues. To repro: dev test pkg/kv/kvserver \ -f TestReplicateQueueUpAndDownReplicateNonVoters \ -v --show-logs --timeout 2m --stress Release note: None
f85d165
to
ee1e9cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @tbg)
pkg/kv/kvserver/replicate_queue_test.go, line 322 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1
then you don't have to handle the NULL result.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @irfansharif, and @tbg)
pkg/kv/kvserver/replicate_queue_test.go, line 319 at r2 (raw file):
`SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1`, scratchRange.GetRangeID()) require.NoError(t, row.Scan(&numNonVoters))
The log.Warning
in the previous version was intentional because this function is being called within a SucceedsSoon
.
Maybe this should've been codified in a comment somewhere but since there's only one voting replica in this test, the leaseholder can only be on that one voter. If this test was flakey before, I'd suspect it to be flakey now. |
Do you mind addressing some of these comments before letting this merge? Halting the merge for now. bors r- |
Canceled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for the scratch range to have been housed on a node other than the first;
since there's only one voting replica in this test, the leaseholder can only be on that one voter
I'm again not following, these two things don't appear to contradict? It was possible for the leaseholder replica (also the voter) to be on a node other than the first; there are three nodes in the test. The test was previously only forcing queue processes on just the first node, but now it does all three. This at least what I was seeing when repro-ing/looking at logs.
If this test was flakey before, I'd suspect it to be flakey now.
Possible, but I'm not sure if it'd be because of the flake that this patch was squashing. FWIW I did run a bunch of test runs before and after this commit to sanity check my claim.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @tbg)
pkg/kv/kvserver/replicate_queue_test.go, line 319 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
The
log.Warning
in the previous version was intentional because this function is being called within aSucceedsSoon
.
I'm not following, this current form works just fine with the SucceedsSoon block. Previously if there were no non-voters, this would pollute the test output with "converting NULL to int is unsupported" errors because of the query and still returning numNonVoters == 0
. Now it doesn't, and just returns to the caller with zero. It's functionally no different. At the caller you'll still get the SucceedsSoon warnings that num voters are not what the test expects. Were you perhaps getting at something else I missed?
Just ran under stress before and after the commit a few more times on my GCE worker; reliably fails within a minute before (tried 3 more times), still running with success after this commit 15m later. |
Discussed offline, my bad for the false alarm. This all LGTM. |
Thanks! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Fixes #75135. This test asserted on span configs applying to a scratch
range. When stressing, it appeared that some time we were not seeing the
scratch range adopt the prescribed number of voters/non-voters. Staring
at the test itself, we were only nudging the replication queues for the
first node in the three node test. It's possible for the scratch range
to have been housed on a node other than the first; this commit makes it
so that the test nudges queues on all nodes. For good measure, lets also
ensure that the split queues process everything, ditto for the snapshot
queues.
To repro:
Release note: None