-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496
Comments
n8 OOM'ed with the last profile showing unbounded growth in #73376 looks like the tracking issue for this class of OOMs. |
cc @cockroachdb/replication |
This test fails regularly1: Looking into the investigation here2 we saw that we were maxing out the EBS volumes (or got throttled in some way) and ultimately found that this test was running on a weird striped RAID03 that ultimately gives the cluster inhomogeneous disk performance profiles, which is not something we support well. That issue is still open4 but @msbutler merged a targeted fix5 on March 30. Still this predates all of the test failures we're looking at above. @erikgrinaker looked into a failure in April here6 saying that we were basically overloading the disks and entering raft overload territory in that way, but in his testing #101437 seemed to address that well enough so that the issue was closed (backport landed on April 187) and we didn't see any failures until July 6, July 10 (and 16 successes since). My read on this is similar to #106248 (comment). We know replication memory protection is insufficient and we're running these clusters with maxed-out disks, so any additional imbalance can tip them into an unhealthy territory. The temporal correlation between this issue and #106496 is interesting since both failed twice at around the same time and then not again. So there is a good chance that whatever is the issue (perhaps some PR that got merged and then reverted, or AWS infra problems, who knows) has been resolved. While it would be nice to have more confidence, not blocking the release still seems to be the right call. Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503. Footnotes
|
108121: sql: rewind txn sequence number in internal executor r=rafiss a=rafiss If the internal executor is used by a user-initiated query, then it shares the same transaction as the user's query. In that case, it's important to step back the sequence number so that a single statement does not read its own writes. fixes #86162 Release note: None 108201: roachtest: suppress grafana link in issue help for non GCE r=smg260 a=smg260 We don't want to show a grafana when the issues originates from non GCE tests/clusters, since we don't scrape from those yet. Epic: none Release note: None 108427: roachtest: provision 250 MB/s for 8TB restore test r=msbutler a=pavelkalinnikov The `restore/tpce/8TB/aws/nodes=10/cpus=8` test maxes out the default 125 MB/s EBS throughput. This commit provisions throughput to be 250 MB/s so that the test doesn't work at the edge of overload. See #107609 (comment) for the before/after comparison. Touches #106496 Fixes #107609 Epic: none Release note: none Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Miral Gadani <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ f62527e83e458f5f4521497e7bfd5c60b70f1e31:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-29590
The text was updated successfully, but these errors were encountered: