Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

Closed
cockroach-teamcity opened this issue Jul 10, 2023 · 5 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 10, 2023

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on release-23.1 @ f62527e83e458f5f4521497e7bfd5c60b70f1e31:

(monitor.go:137).Wait: monitor failure: monitor command failure: unexpected node event: 8: dead (exit status 137)
(test_runner.go:1120).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29590

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jul 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jul 10, 2023
@adityamaru
Copy link
Contributor

n8 OOM'ed with the last profile showing unbounded growth in maybeInlineSideloadedRaftCommand:

Screenshot 2023-07-10 at 12 35 40 PM

#73376 looks like the tracking issue for this class of OOMs.

@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 10, 2023
@adityamaru adityamaru added A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication and removed T-disaster-recovery labels Jul 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 10, 2023

cc @cockroachdb/replication

@tbg
Copy link
Member

tbg commented Jul 19, 2023

This test fails regularly1:

Details

(CI does not have history before April)

image

Looking into the investigation here2 we saw that we were maxing out the EBS volumes (or got throttled in some way) and ultimately found that this test was running on a weird striped RAID03 that ultimately gives the cluster inhomogeneous disk performance profiles, which is not something we support well. That issue is still open4 but @msbutler merged a targeted fix5 on March 30. Still this predates all of the test failures we're looking at above.

@erikgrinaker looked into a failure in April here6 saying that we were basically overloading the disks and entering raft overload territory in that way, but in his testing #101437 seemed to address that well enough so that the issue was closed (backport landed on April 187) and we didn't see any failures until July 6, July 10 (and 16 successes since).

My read on this is similar to #106248 (comment). We know replication memory protection is insufficient and we're running these clusters with maxed-out disks, so any additional imbalance can tip them into an unhealthy territory. The temporal correlation between this issue and #106496 is interesting since both failed twice at around the same time and then not again. So there is a good chance that whatever is the issue (perhaps some PR that got merged and then reverted, or AWS infra problems, who knows) has been resolved. While it would be nice to have more confidence, not blocking the release still seems to be the right call.

Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503.

Footnotes

  1. https://teamcity.cockroachdb.com/test/-8181195248832451825?currentProjectId=Cockroach_Nightlies&expandTestHistoryChartSection=true&orderBy=status&order=desc

  2. https://github.com/cockroachdb/cockroach/issues/97019#issuecomment-1441842625

  3. https://github.com/cockroachdb/cockroach/issues/97019#issuecomment-1479401456

  4. https://github.com/cockroachdb/cockroach/issues/98783

  5. https://github.com/cockroachdb/cockroach/pull/100136/files

  6. https://github.com/cockroachdb/cockroach/issues/100341#issuecomment-1505496453

  7. https://github.com/cockroachdb/cockroach/pull/101508

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023
@tbg tbg assigned pav-kv and unassigned tbg Jul 19, 2023
@tbg tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023
@tbg tbg changed the title roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] Jul 25, 2023
@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2023

At least half of the nodes experience an extended period of being at the write throughput threshold of 125 MB/s (same behaviour as described in #99206):

Screenshot 2023-08-07 at 16 23 33

We should bump the threshold as #107609 suggests.

craig bot pushed a commit that referenced this issue Aug 10, 2023
108121: sql: rewind txn sequence number in internal executor r=rafiss a=rafiss

If the internal executor is used by a user-initiated query, then it shares the same transaction as the user's query. In that case, it's important to step back the sequence number so that a single statement does not read its own writes.

fixes #86162
Release note: None

108201: roachtest: suppress grafana link in issue help for non GCE r=smg260 a=smg260

We don't want to show a grafana when the issues originates from non GCE tests/clusters, since we don't scrape from those yet.

Epic: none

Release note: None

108427: roachtest: provision 250 MB/s for 8TB restore test r=msbutler a=pavelkalinnikov

The `restore/tpce/8TB/aws/nodes=10/cpus=8` test maxes out the default 125 MB/s EBS throughput. This commit provisions throughput to be 250 MB/s so that the test doesn't work at the edge of overload.

See #107609 (comment) for the before/after comparison.

Touches #106496
Fixes #107609
Epic: none
Release note: none

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Miral Gadani <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 10, 2023

Closing by #108513. The OOM may pop up again (less likely now), until we properly fix #73376.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

6 participants