-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52206
Comments
I think this is the same as the first failures in #48255.
@nvanbenschoten does that sound ok? |
Yes, it looks like that's part of the issue. The other part is the 1h40m delay we add in to wait for rebalancing (see #44999). It's unclear whether even that was enough because we then proceed to badly fail each iteration of the test. I wonder if we should introduce a minimum warehouse count for these kinds of tests. Right now, they'll just search all the way down to 0 if there are issues. This will likely lead to a test timeout instead of a more descriptive "test failed due to perf issues" error. I think the other piece of this is that we don't take the expected test time into account when scheduling roachtests. We can see from TC (https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=-7667002519850730298&tab=testDetails) that when this test passes, it usually takes a little over 6h to run. If it starts running too late into the overall roachtest run, then there's a good chance that it will not complete in time. I think this might actually be fairly common because of the way we prioritize cluster re-use. I wonder if we should add a new heuristic to run tests in reverse order of their expected duration. Or just throw a bit more parallelism and CPU quota at the nightly roachtests. Is there any reason to not crank to parallelism way up? Just that doing so would undermine the cluster re-use policy if not combined with a "diversity" heuristic for choosing what to run concurrently? |
Is there an overall timeout for all the tests, separate from the per-test timeout? |
Yes, 20 hours: https://github.com/cockroachdb/cockroach/blob/master/build/teamcity-nightly-roachtest.sh#L114. But we don't seem to be hitting that here. We're actually hitting the 10h timeout. That doesn't add up, even with the 4 hour import. Looking at the logs, we see:
Notice the 3 hour pause for a ~20m operation. Looking at the
|
Now check out the workload logs (
Why did this command get stuck on all three load gen nodes? |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@607d3fb91a15fbd4613f22396d5828c4d27d390b:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@0df968b1e237ccd88cef89e851bcd52e90932280:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
See this test on roachdash |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@20ea783887c1f33ab925cc8233041c54b58da1c5:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #51160 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202007071743_v20.2.0-alpha.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #50698 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #50518 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202006220937_v19.2.8 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #50142 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202006091546_v20.1.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #48255 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-master release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #46343 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: