roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

cockroach-teamcity · 2023-07-06T09:02:44Z

roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ fa2d7f7c9894d701ac4a393f058aa84552957087:

(test_runner.go:1073).runTest: test timed out (1h0m0s)
(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_080048.232311134_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 137
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=4/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-29460}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-07-09T08:52:17Z

roachtest.restore/tpce/400GB/aws/nodes=4/cpus=8 failed with artifacts on release-23.1 @ 15480b8496e7928845330055d191e331b228eb66:

(monitor.go:137).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)
(test_runner.go:1120).func1: 1 dead node(s) detected
(test_runner.go:1120).func1: operation "invalid descriptors check" timed out after 1m0.027s (given timeout 1m0s): pq: query execution canceled
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=4/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

dt · 2023-07-09T17:22:53Z

n3 oomed. 96% of memory usage is in raft, just over 11gb.

dt · 2023-07-09T17:22:59Z

cc @irfansharif

irfansharif · 2023-07-10T15:24:30Z

+cc @pavelkalinnikov. #106486 is the other AWS roachtest failure.

blathers-crl · 2023-07-10T15:33:40Z

cc @cockroachdb/replication

irfansharif · 2023-07-10T16:45:08Z

+cc #73376.

tbg · 2023-07-19T09:19:27Z

Looking at the test history, this test started running in earnest¹ in end of March '23, and promptly [failed] (edit: that's actually another, similar, test) (https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestFipsNightlyAwsBazel/9365035?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true) with an OOM, on March 31:

Details

So just from that alone we can argue that this is not a release blocker - we introduced a new test, it fails, likely highlighting an existing problem. Mechanistically, we understand why this class of test is likely to fail - we know memory bounding in the replication layer is insufficient and will cause issues when overloading disks, which is likely on AWS due to the relatively slow default gp3 provisioning (combined r+w throughput 125mb/s).

introduced a little earlier probably, https://github.com/cockroachdb/cockroach/commit/b588f1f02c4e7bd4a10a08d9b20a914e03751e40 ↩

tbg · 2023-07-19T09:52:00Z

I need to spend some more time here to validate that though. Over on #106496 I'm seeing that the test already exists on 27fcfd3 which is from Feb 28. Perhaps CI is not retaining history all the way, or we changed the CI config only late in March to actually run these tests. Will look a bit more later.

tbg · 2023-07-19T12:03:07Z

Looking across branches, the only failures we have for this particular test are three recent ones; in contrast there are 188 successful runs (dating back to late March, before which I think CI has deleted history).

The first failure is s3 object does not exist: external_storage: file doesn't exist and it was triggered on #105969, which was backported on the same day (July 3) as well¹.

We then see failures on July 6 and July 9, followed by (to date) 20 successful runs.
So perhaps changing the bucket altered the speed with which the fixtures are ingested, but then 20 subsequent successes (and a few interspersed ones) don't suggest that this is the case.

If there is a regression, it is not a common one and it will be extremely difficult to pinpoint. On top of this, since this is AWS, we don't have Grafana² which would help a lot with comparing some metrics side by side over time.

Another PR whose timing lines up is #104861. This was merged on July 5, but but not backported. It thus cannot affect the release-23.1 branch, and besides, the same argument about having had many successful runs above applies.

Realistically, we will have to live with these test failures and address them one by one through CRDB-25503.

Handing this back to @pavelkalinnikov to track the eventual resolution through https://cockroachlabs.atlassian.net/browse/CRDB-25503.

tbg · 2023-07-25T09:52:32Z

https://pprof.me/4ded4ab/ (don't forget to switch to inuse bytes) shows mem in raft transport send queue.

pav-kv · 2023-08-07T17:27:28Z

The first of the 2 failures is not an OOM, it just times out and gets killed after 1h. Here are some metrics:

It uses 1/2 of the memory at the peak, and stays under 125 MB/s disk throughput.

pav-kv · 2023-08-10T17:35:14Z

The graphs above look unusual. Most of the time it should operate close to disks throughput capacity. So the timeout is reasonable: the writes were too slow for some reason.

To compare, I ran this test manually, and see something like this T(very similar to ones in the 8TB test):

The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and don't max it out. This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all restore tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelyhood of raft OOMs (which manifest more often when disk is overloaded). Fixes cockroachdb#107609 Touches cockroachdb#106248 Epic: none Release note: none

pav-kv · 2023-08-22T09:34:07Z

Closing, as this is tracked by #107609.

109221: roachtest: provision 250 MB/s for restore tests on AWS r=pavelkalinnikov a=pavelkalinnikov The `restore/tpce/*` family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and [don't max it out](#107609 (comment)). This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all `restore` tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelihood of raft OOMs (which manifest more often when disk is overloaded). Fixes #107609 Touches #106248 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <[email protected]>

The restore/tpce/* family of tests on AWS max out the default 125 MB/s EBS throughput. In contrast, similar tests in GCE provision for more throughput and don't max it out. This commit bumps the provisioned throughput from 125 MB/s to 250 MB/s in all restore tests on AWS, so that the tests don't work at the edge of overload. This both brings some parity between testing on GCE and AWS, and reduces likelyhood of raft OOMs (which manifest more often when disk is overloaded). Fixes #106248 Epic: none Release note: none

cockroach-teamcity added this to the 23.1 milestone Jul 6, 2023

blathers-crl bot added the T-kv KV Team label Jul 9, 2023

dt removed the T-disaster-recovery label Jul 9, 2023

pav-kv added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 10, 2023

blathers-crl bot added the T-kv-replication label Jul 10, 2023

pav-kv added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 10, 2023

irfansharif mentioned this issue Jul 10, 2023

roachtest: restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 failed #106486

Closed

arulajmani assigned pav-kv Jul 10, 2023

shralex removed the T-kv KV Team label Jul 18, 2023

tbg assigned tbg and unassigned pav-kv Jul 19, 2023

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023

tbg mentioned this issue Jul 19, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

Closed

tbg assigned pav-kv and unassigned tbg Jul 19, 2023

tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023

tbg changed the title ~~roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed~~ roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication mem bounds] Jul 25, 2023

tbg changed the title ~~roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication mem bounds]~~ roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] Jul 25, 2023

pav-kv mentioned this issue Aug 22, 2023

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

Closed

pav-kv mentioned this issue Aug 22, 2023

roachtest: provision 250 MB/s for restore tests on AWS #109221

Merged

pav-kv closed this as completed Aug 22, 2023

pav-kv mentioned this issue Aug 22, 2023

release-23.1: roachtest: provision 250 MB/s for restore tests on AWS #109278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

cockroach-teamcity commented Jul 6, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Jul 9, 2023

dt commented Jul 9, 2023

dt commented Jul 9, 2023

irfansharif commented Jul 10, 2023 •

edited

Loading

blathers-crl bot commented Jul 10, 2023

irfansharif commented Jul 10, 2023

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 19, 2023

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 25, 2023

pav-kv commented Aug 7, 2023 •

edited

Loading

pav-kv commented Aug 10, 2023 •

edited

Loading

pav-kv commented Aug 22, 2023

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

roachtest: restore/tpce/400GB/aws/nodes=4/cpus=8 failed [CRDB-25503 replication send oom] #106248

Comments

cockroach-teamcity commented Jul 6, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Jul 9, 2023

dt commented Jul 9, 2023

dt commented Jul 9, 2023

irfansharif commented Jul 10, 2023 • edited Loading

blathers-crl bot commented Jul 10, 2023

irfansharif commented Jul 10, 2023

tbg commented Jul 19, 2023 • edited Loading

Footnotes

tbg commented Jul 19, 2023

tbg commented Jul 19, 2023 • edited Loading

Footnotes

tbg commented Jul 25, 2023

pav-kv commented Aug 7, 2023 • edited Loading

pav-kv commented Aug 10, 2023 • edited Loading

pav-kv commented Aug 22, 2023

cockroach-teamcity commented Jul 6, 2023 •

edited by cockroach-jira-scripts

Loading

irfansharif commented Jul 10, 2023 •

edited

Loading

tbg commented Jul 19, 2023 •

edited

Loading

tbg commented Jul 19, 2023 •

edited

Loading

pav-kv commented Aug 7, 2023 •

edited

Loading

pav-kv commented Aug 10, 2023 •

edited

Loading