roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

cockroach-teamcity · 2023-07-10T08:53:50Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ f62527e83e458f5f4521497e7bfd5c60b70f1e31:

(monitor.go:137).Wait: monitor failure: monitor command failure: unexpected node event: 8: dead (exit status 137)
(test_runner.go:1120).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #106247 roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-29590}

The text was updated successfully, but these errors were encountered:

adityamaru · 2023-07-10T16:40:07Z

n8 OOM'ed with the last profile showing unbounded growth in maybeInlineSideloadedRaftCommand:

#73376 looks like the tracking issue for this class of OOMs.

blathers-crl · 2023-07-10T16:40:48Z

cc @cockroachdb/replication

tbg · 2023-07-19T12:20:14Z

This test fails regularly¹:

Details

(CI does not have history before April)

Looking into the investigation here² we saw that we were maxing out the EBS volumes (or got throttled in some way) and ultimately found that this test was running on a weird striped RAID0³ that ultimately gives the cluster inhomogeneous disk performance profiles, which is not something we support well. That issue is still open⁴ but @msbutler merged a targeted fix⁵ on March 30. Still this predates all of the test failures we're looking at above.

@erikgrinaker looked into a failure in April here⁶ saying that we were basically overloading the disks and entering raft overload territory in that way, but in his testing #101437 seemed to address that well enough so that the issue was closed (backport landed on April 18⁷) and we didn't see any failures until July 6, July 10 (and 16 successes since).

My read on this is similar to #106248 (comment). We know replication memory protection is insufficient and we're running these clusters with maxed-out disks, so any additional imbalance can tip them into an unhealthy territory. The temporal correlation between this issue and #106496 is interesting since both failed twice at around the same time and then not again. So there is a good chance that whatever is the issue (perhaps some PR that got merged and then reverted, or AWS infra problems, who knows) has been resolved. While it would be nice to have more confidence, not blocking the release still seems to be the right call.

Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503.

pav-kv · 2023-08-07T15:26:00Z

At least half of the nodes experience an extended period of being at the write throughput threshold of 125 MB/s (same behaviour as described in #99206):

We should bump the threshold as #107609 suggests.

108121: sql: rewind txn sequence number in internal executor r=rafiss a=rafiss If the internal executor is used by a user-initiated query, then it shares the same transaction as the user's query. In that case, it's important to step back the sequence number so that a single statement does not read its own writes. fixes #86162 Release note: None 108201: roachtest: suppress grafana link in issue help for non GCE r=smg260 a=smg260 We don't want to show a grafana when the issues originates from non GCE tests/clusters, since we don't scrape from those yet. Epic: none Release note: None 108427: roachtest: provision 250 MB/s for 8TB restore test r=msbutler a=pavelkalinnikov The `restore/tpce/8TB/aws/nodes=10/cpus=8` test maxes out the default 125 MB/s EBS throughput. This commit provisions throughput to be 250 MB/s so that the test doesn't work at the edge of overload. See #107609 (comment) for the before/after comparison. Touches #106496 Fixes #107609 Epic: none Release note: none Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Miral Gadani <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

pav-kv · 2023-08-10T15:49:14Z

Closing by #108513. The OOM may pop up again (less likely now), until we properly fix #73376.

cockroach-teamcity added this to the 23.1 milestone Jul 10, 2023

blathers-crl bot added the T-kv KV Team label Jul 10, 2023

adityamaru added A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication and removed T-disaster-recovery labels Jul 10, 2023

irfansharif mentioned this issue Jul 10, 2023

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

Closed

arulajmani assigned pav-kv Jul 10, 2023

shralex removed the T-kv KV Team label Jul 18, 2023

tbg mentioned this issue Jul 19, 2023

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

Closed

tbg assigned tbg and unassigned pav-kv Jul 19, 2023

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023

tbg assigned pav-kv and unassigned tbg Jul 19, 2023

cockroach-teamcity mentioned this issue Jul 19, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #107167

Closed

tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023

tbg changed the title ~~roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed~~ roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] Jul 25, 2023

irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023

This was referenced Aug 8, 2023

roachtest: double machine size for 8TB restore test #108350

Closed

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

Closed

roachtest: provision 250 MB/s for 8TB restore test #108427

Merged

pav-kv mentioned this issue Aug 10, 2023

release-23.1: roachtest: provision 250 MB/s for 8TB restore test #108513

Merged

pav-kv closed this as completed Aug 10, 2023

pav-kv mentioned this issue Sep 21, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #110764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

cockroach-teamcity commented Jul 10, 2023 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Jul 10, 2023

blathers-crl bot commented Jul 10, 2023

tbg commented Jul 19, 2023 •

edited

Loading

pav-kv commented Aug 7, 2023 •

edited

Loading

pav-kv commented Aug 10, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

Comments

cockroach-teamcity commented Jul 10, 2023 • edited by cockroach-jira-scripts Loading

adityamaru commented Jul 10, 2023

blathers-crl bot commented Jul 10, 2023

tbg commented Jul 19, 2023 • edited Loading

Footnotes

pav-kv commented Aug 7, 2023 • edited Loading

pav-kv commented Aug 10, 2023

cockroach-teamcity commented Jul 10, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jul 19, 2023 •

edited

Loading

pav-kv commented Aug 7, 2023 •

edited

Loading