-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541
Comments
Hi @msbutler, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Let's reskip this test until we can write a stable version of it or investigate the slowness. |
Informs cockroachdb#112541 Release note: None
Hrm, actually, I wonder if some of these will be resolved once most branches are ahead of #112597 |
Yeah, it looks like the branch (#112613) rafi's referring too needs to be rebased |
I see a timeout even after the new 3m45s - https://teamcity.cockroachdb.com/viewLog.html?buildId=12260852&buildTypeId=Cockroach_BazelExtendedCi in testrace. Is it worth skipping under test race too and only leaving regular execution unskipped? |
The test uses RF=1, which if unlucky will require multiple queue scanner cycles to remove replicas (3->1), then another couple scanner cycles to rebalance the replica to satisfy the constraint. This generally occurs when the new leaseholder doesn't have tracking for the follower replicas, and then excludes them as removal candidates (in addition to the comments mentioned on the issue): cockroach/pkg/kv/kvserver/allocator/plan/replicate.go Lines 583 to 612 in 34191e4
Because a benign error is returned, and the remove voter retry loop here doesn't retry (<=2), the result is a full scanner cycle until the next time the range can be retried. Relevant logs from a repro with Details
For higher RFs, the |
112735: streamingccl: reduce scan interval for testing r=msbutler a=kvoli Reduce the replica scanner min interval from 1s, to 10ms for test clusters. This speeds up tests which rely on replica changes either on the source, or host cluster. ``` dev test pkg/ccl/streamingccl/streamingest \ -f TestStreamingRegionalConstraint -v --stress ... Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s ``` Resolves: #112541 Release note: None Co-authored-by: Austen McClernon <[email protected]>
Reduce the replica scanner min interval from 1s, to 10ms for test clusters. This speeds up tests which rely on replica changes either on the source, or host cluster. ``` dev test pkg/ccl/streamingccl/streamingest \ -f TestStreamingRegionalConstraint -v --stress ... Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s ``` Resolves: #112541 Release note: None
Per discussion with @kvoli and @stevendanna , we're quite confident that #111541 fails due to allocator cpu starvation in our unit test environment. I'm opening this issue in case any of us would like to further investigate this, though we have much bigger fish to fry right now.
The allocator may be particularly slow in this test because TestStreamingRegionalConstraint adds a 1 replica constraint on a table, which could hit:
cockroach/pkg/kv/kvserver/allocator/plan/util.go
Lines 41 to 62 in cddcddd
(note that the the 1 replica constraint avoids the need to set up more test servers, which would further exacerbate cpu starvation problems)
It's also worth noting that other end-to-end unit tests that exercise allocator code are skipped under stress
cockroach/pkg/kv/kvserver/replicate_queue_test.go
Line 2212 in cddcddd
Jira issue: CRDB-32483
The text was updated successfully, but these errors were encountered: