-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248
Comments
roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ 15480b8496e7928845330055d191e331b228eb66:
Parameters: |
n3 oomed. 96% of memory usage is in raft, just over 11gb. |
cc @irfansharif |
+cc @pavelkalinnikov. #106486 is the other AWS roachtest failure. |
cc @cockroachdb/replication |
+cc #73376. |
Looking at the test history, this test started running in earnest1 in end of March '23, and promptly [failed] (edit: that's actually another, similar, test) (https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestFipsNightlyAwsBazel/9365035?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true) with an OOM, on March 31: So just from that alone we can argue that this is not a release blocker - we introduced a new test, it fails, likely highlighting an existing problem. Mechanistically, we understand why this class of test is likely to fail - we know memory bounding in the replication layer is insufficient and will cause issues when overloading disks, which is likely on AWS due to the relatively slow default gp3 provisioning (combined r+w throughput 125mb/s). Footnotes
|
I need to spend some more time here to validate that though. Over on #106496 I'm seeing that the test already exists on 27fcfd3 which is from Feb 28. Perhaps CI is not retaining history all the way, or we changed the CI config only late in March to actually run these tests. Will look a bit more later. |
Looking across branches, the only failures we have for this particular test are three recent ones; in contrast there are 188 successful runs (dating back to late March, before which I think CI has deleted history). The first failure is We then see failures on July 6 and July 9, followed by (to date) 20 successful runs. If there is a regression, it is not a common one and it will be extremely difficult to pinpoint. On top of this, since this is AWS, we don't have Grafana2 which would help a lot with comparing some metrics side by side over time. Another PR whose timing lines up is #104861. This was merged on July 5, but but not backported. It thus cannot affect the release-23.1 branch, and besides, the same argument about having had many successful runs above applies. Realistically, we will have to live with these test failures and address them one by one through CRDB-25503. Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503. Footnotes |
https://pprof.me/4ded4ab/ (don't forget to switch to inuse bytes) shows mem in raft transport send queue. |
The graphs above look unusual. Most of the time it should operate close to disks throughput capacity. So the timeout is reasonable: the writes were too slow for some reason. To compare, I ran this test manually, and see something like this T(very similar to ones in the 8TB test): |
The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and don't max it out. This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all restore tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelyhood of raft OOMs (which manifest more often when disk is overloaded). Fixes cockroachdb#107609 Touches cockroachdb#106248 Epic: none Release note: none
Closing, as this is tracked by #107609. |
109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)). This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded). Fixes #107609 Touches #106248 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <[email protected]>
The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and don't max it out. This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all restore tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelyhood of raft OOMs (which manifest more often when disk is overloaded). Fixes #106248 Epic: none Release note: none
roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ fa2d7f7c9894d701ac4a393f058aa84552957087:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29460
The text was updated successfully, but these errors were encountered: