-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486
Comments
n8 oomkilled. unfortunately the last heap profile in the artifacts doesn't show anything unusual. It passed in the last thee runs, on 6/27, 6/24, and 7/1, so this might be new and maybe related to the removal of the req limiter? /cc @irfansharif |
Given the other restore failures last night did have usable profiles and those pointed to raft, I'm going to guess this is the same |
cc @cockroachdb/replication |
+cc #73376. |
Why are these all popping up now, on 23.1 even? Did we change machine types or something, or backport anything related? |
This test is new1 and weekly. It has passed exactly once, on July 15. The first failure was a panic during a split which was resolved here. So we have very little signal. It's also a massive test that is expected to show problems such as #102840 more prominently. Either way, hard to justify that this should be a release blocker. Footnotes |
I should also add that some of the existing restore tests oomed in the July 6 - July 10 period and them stopped doing that - across the master/release branch. This hints at something in the infrastructure that was temporarily different. x-ref #106248 (comment) Handing this back to @pavelkalinnikov to track the eventual resolution through cockroachlabs.atlassian.net/browse/CRDB-25503. |
I took a look and I don't see evidence of n8 being oomkilled. In fact, the post-test checks all confirm it's running and all nodes return 200 on the ready endpoint. Details
I spot checked a few logs and they give off a good vibe (i.e. no obvious breakage, slow readies, etc). This looks more like a network infra flake to me:
Without more debug info (like at least #97788 and also the debug zip didn't work here) not sure what else to do - will close. Note that the test has passed a few times since. |
roachtest.restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=16
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29582
The text was updated successfully, but these errors were encountered: