Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

Closed
cockroach-teamcity opened this issue Jul 6, 2023 · 13 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 6, 2023

roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ fa2d7f7c9894d701ac4a393f058aa84552957087:

(test_runner.go:1073).runTest: test timed out (1h0m0s)
(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_080048.232311134_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 137
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=4/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29460

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jul 6, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jul 6, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ 15480b8496e7928845330055d191e331b228eb66:

(monitor.go:137).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)
(test_runner.go:1120).func1: 1 dead node(s) detected
(test_runner.go:1120).func1: operation "invalid descriptors check" timed out after 1m0.027s (given timeout 1m0s): pq: query execution canceled
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=4/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@dt
Copy link
Member

dt commented Jul 9, 2023

n3 oomed. 96% of memory usage is in raft, just over 11gb.

@dt
Copy link
Member

dt commented Jul 9, 2023

cc @irfansharif

@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 9, 2023
@irfansharif
Copy link
Contributor

irfansharif commented Jul 10, 2023

+cc @pavelkalinnikov. #106486 is the other AWS roachtest failure.

@pav-kv pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 10, 2023

cc @cockroachdb/replication

@pav-kv pav-kv added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 10, 2023
@irfansharif
Copy link
Contributor

+cc #73376.

@shralex shralex removed the T-kv KV Team label Jul 18, 2023
@tbg
Copy link
Member

tbg commented Jul 19, 2023

Looking at the test history, this test started running in earnest1 in end of March '23, and promptly [failed] (edit: that's actually another, similar, test) (https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestFipsNightlyAwsBazel/9365035?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true) with an OOM, on March 31:

Details image

So just from that alone we can argue that this is not a release blocker - we introduced a new test, it fails, likely highlighting an existing problem. Mechanistically, we understand why this class of test is likely to fail - we know memory bounding in the replication layer is insufficient and will cause issues when overloading disks, which is likely on AWS due to the relatively slow default gp3 provisioning (combined r+w throughput 125mb/s).

Footnotes

  1. introduced a little earlier probably, https://github.com/cockroachdb/cockroach/commit/b588f1f02c4e7bd4a10a08d9b20a914e03751e40

@tbg
Copy link
Member

tbg commented Jul 19, 2023

I need to spend some more time here to validate that though. Over on #106496 I'm seeing that the test already exists on 27fcfd3 which is from Feb 28. Perhaps CI is not retaining history all the way, or we changed the CI config only late in March to actually run these tests. Will look a bit more later.

@tbg tbg assigned tbg and unassigned pav-kv Jul 19, 2023
@tbg
Copy link
Member

tbg commented Jul 19, 2023

Looking across branches, the only failures we have for this particular test are three recent ones; in contrast there are 188 successful runs (dating back to late March, before which I think CI has deleted history).

image

The first failure is s3 object does not exist: external_storage: file doesn't exist and it was triggered on #105969, which was backported on the same day (July 3) as well1.

We then see failures on July 6 and July 9, followed by (to date) 20 successful runs.
So perhaps changing the bucket altered the speed with which the fixtures are ingested, but then 20 subsequent successes (and a few interspersed ones) don't suggest that this is the case.

If there is a regression, it is not a common one and it will be extremely difficult to pinpoint. On top of this, since this is AWS, we don't have Grafana2 which would help a lot with comparing some metrics side by side over time.

Another PR whose timing lines up is #104861. This was merged on July 5, but but not backported. It thus cannot affect the release-23.1 branch, and besides, the same argument about having had many successful runs above applies.

Realistically, we will have to live with these test failures and address them one by one through CRDB-25503.

Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503.

Footnotes

  1. https://github.com/cockroachdb/cockroach/pull/106073

  2. https://github.com/cockroachdb/cockroach/issues/97788

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023
@tbg tbg assigned pav-kv and unassigned tbg Jul 19, 2023
@tbg tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023
@tbg tbg changed the title roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication mem bounds] Jul 25, 2023
@tbg
Copy link
Member

tbg commented Jul 25, 2023

https://pprof.me/4ded4ab/ (don't forget to switch to inuse bytes) shows mem in raft transport send queue.

@tbg tbg changed the title roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication mem bounds] roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] Jul 25, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2023

The first of the 2 failures is not an OOM, it just times out and gets killed after 1h. Here are some metrics:

Screenshot 2023-08-07 at 18 22 17 Screenshot 2023-08-07 at 18 22 22 Screenshot 2023-08-07 at 18 22 32

It uses 1/2 of the memory at the peak, and stays under 125 MB/s disk throughput.

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 10, 2023

The graphs above look unusual. Most of the time it should operate close to disks throughput capacity. So the timeout is reasonable: the writes were too slow for some reason.

To compare, I ran this test manually, and see something like this T(very similar to ones in the 8TB test):

Screenshot 2023-08-10 at 18 29 53 Screenshot 2023-08-10 at 18 32 42 Screenshot 2023-08-10 at 18 29 59

pav-kv added a commit to pav-kv/cockroach that referenced this issue Aug 22, 2023
The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS
throughput. In contrast, similar tests in GCE provision for more throughput and
don't max it out.

This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all
restore tests on AWS, so that the tests don't work at the edge of overload.

This both brings some parity between testing on GCE and AWS, and reduces
likelyhood of raft OOMs (which manifest more often when disk is overloaded).

Fixes cockroachdb#107609
Touches cockroachdb#106248
Epic: none
Release note: none
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 22, 2023

Closing, as this is tracked by #107609.

@pav-kv pav-kv closed this as completed Aug 22, 2023
craig bot pushed a commit that referenced this issue Aug 22, 2023
109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov

The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)).

This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload.

This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded).

Fixes #107609
Touches #106248
Epic: none
Release note: none

Co-authored-by: Pavel Kalinnikov <[email protected]>
pav-kv added a commit that referenced this issue Aug 23, 2023
The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS
throughput. In contrast, similar tests in GCE provision for more throughput and
don't max it out.

This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all
restore tests on AWS, so that the tests don't work at the edge of overload.

This both brings some parity between testing on GCE and AWS, and reduces
likelyhood of raft OOMs (which manifest more often when disk is overloaded).

Fixes #106248
Epic: none
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

6 participants