roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

cockroach-teamcity · 2023-02-12T19:06:59Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2a7edbeb0737b1309064c25c641a309c2980d9ba:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: full test specs: restore/tpce/8TB/inc-count=48/aws/nodes=10/cpus=8/volSize=2000GB: output in run_163914.880098375_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05 20:45:00'" returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_163914.888338965_n1_cockroach-sql-insecu.log: exit status 1

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-24470}

The text was updated successfully, but these errors were encountered:

rhu713 · 2023-02-14T20:43:44Z

Error is

E230212 19:04:19.947035 2000 cloud/cloud_io.go:223 ⋮ [T1,n10,f‹7235878f›,job=839350871328948225,distsql.gateway=‹1›] 9061  Read err: read tcp ‹10.12.21.166:37394› -> ‹3.5.28.175:443›: read: connection timed out

I believe this is already part of the default retryable errors. We don't set custom timeouts or http clients anywhere in s3_storage.go, so I'm guessing the timeouts must come from one of the timeouts in http.DefaultTransport.

We can look to have a custom http client in s3_storage with longer timeouts, but I'm going to remove the release blocker for now.

cockroach-teamcity · 2023-02-15T21:37:52Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 64a867acd25c0a214209957eefb6483d1158b4f0:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:989).runTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

msbutler · 2023-02-15T22:00:26Z

note that in last night's run, I accidentally changed the test to restore from 48 incremental backups instead of 11. It still may be interesting to investigate this timeout. I'm fixing that here:
#97217 (review)

cockroach-teamcity · 2023-02-19T17:39:04Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 3d054f37c7c87f53cb56fac4e5500f0d1130d09a:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-02-21T16:51:23Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ dd2749ae4ab61eed2f99238acb74e8d3c6b4cb1d:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: read tcp 172.17.0.3:49252 -> 3.21.245.171:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-02-22T18:25:04Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

pav-kv · 2023-02-23T12:10:38Z

Seeing a bunch of range splits to accommodate the import, up to 16:41:

At 16:40, a stream of SST ingestions started. 239 messages between 16:40-16:48 (when crash happened) like this:

I230221 16:40:21.116097 267 kv/kvserver/replica_proposal.go:546 ⋮ [T1,n2,s2,r919/2:‹/Table/134/1/20000012{05…-51…}›,raft] 3557  ingesting SST of size 8.6 MiB at index 56 took 0.55s (0.14s on which in PreIngestDelay)

Shortly, started seeing these errors in the logs:

W230221 16:42:49.364379 251 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1005/2:‹/Table/134/1/2000001{75…-80…}›] 5351  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:44:31.921049 233 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1008/2:‹/Table/134/1/20000018{09…-61…}›] 6456  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:44:39.108249 246 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r2114/1:‹/Table/134/1/2000040{77…-83…}›] 6561  1 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:45:09.598503 226 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1049/2:‹/Table/134/1/2000003{85…-90…}›] 7076  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:26.415719 264 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1025/2:‹/Table/134/1/200000{195…-200…}›] 8268  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:29.875566 259 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1047/4:‹/Table/134/1/20000152{31…-77…}›] 8298  4 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:46.333238 255 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1228/2:‹/Table/134/1/2000015{27…-32…}›] 8488  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal

The default uncommitted entry size limit is 16 MiB, which is also soft, so a single SST always fits it, but here we see that an attempt to add another one exceeds the limit and gets dropped.

This indicates that there is probably a build-up of AddSST commands in memory.

pav-kv · 2023-02-23T12:12:55Z

Given that there are ~4400 replicas ingesting, and each SST is ~8.3 MiB (also 3.2 MiB appears in logs less frequently), the build-up can easily reach tens of GiB.

pav-kv · 2023-02-23T12:18:29Z

Looks like #73376.

pav-kv · 2023-02-23T12:22:46Z

In this test n2 stands out with a slow Raft log commit latency:

(50% too, not just 99%)

Which is probably why this node ended up with an in-memory build-up.

pav-kv · 2023-02-23T12:26:11Z

n2 is also getting "recovery" snapshots.

pav-kv · 2023-02-23T12:29:33Z

It's also reporting slow heartbeats (also seen the corresponding messages in logs):

Liveness heartbeats do a local store write, which is probably slow here (given the log commit latency of the same O(second)).

pav-kv · 2023-02-23T12:37:50Z

There is an engine backpressure for AddSST on n2:

pav-kv · 2023-02-23T13:54:18Z

Draft notes while investigating with @tbg:

disk seems overloaded: write flatline at 250mb/s, when there are reads on this node it dips down, so just overloaded disk overall. conjecture: n2 ran out of EBS burst and so got throttled, maybe other nodes didn't due to imbalance, we're not sure, don't know instance type etc
raft oom is just a symptom of us not handling disk overload too well here --> ping admission control folks (irfan): disk is "obviously" under stress, but LSM is "fine", addssts get throttled above raft, conjecture is that follower SSTs come in, get into receive queue, get flushed into RawNode's unstable log, and then there's unbounded in mem buffering there. Would understand this as an admission control scenario, should discuss. Follower pausing wouldn't have helped here. n2 was probably not enforced by proposal quota on the leaseholders any more, but why? Maybe a snapshot
raft recv queue was at <1mb throughout
understanding the proposal quota here would be helpful. We'd like to know on how many leaseholders n2 was being ignored for quota enforcement (or how many MB of entries ignored n2 for quota). If it was ignored, the faster quorum could make progress at its own pace; n2 would drink from a firehose with too much flow for its disk, so unbounded memory usage would occur in the unstable log. This is likely what we're seeing here but it is not easy to prove it from the metrics we have.
if proposal quota was enforced for n2, we should see at most num_leaseholders * 16mb piling up at n2 (in practice less, since not all ranges have active writes)
metric for size of unstable log would be nice too since unbounded mem growth can occur there
is this an old test, why is it failing now? Maybe it has to do with async appends which allows more entries to enter the unstable log quickly. Go through history of test, see when it started failing, see if that lines up with any notable PRs.
it's a new test: backupccl: introduce new restore roachtest framework #94143

msbutler · 2023-02-23T15:27:04Z

@pavelkalinnikov this is a new test that has a added much better restore workload coverage. This test uses a much better workload (tpce >>>> bank), and restores more data (8TB > 2 TB).

@rhu713 landed a few PRs that significantly changed restore, summarized in this backport: #97210, latest of which merged last week.

My only guess to a correlation between his PR's and this failure is that his may make restore more prone to unevenly distributing work. @rhu713 any thoughts here?

pav-kv · 2023-02-23T15:36:06Z

@msbutler @rhu713 One of the key observations was this:

You can barely see it here (the yellow line), but:

Disk writes on n2 is a flat line capped at 250 MiB/s
Except during a couple of "Disk Read" spikes during which the "Disk Write" dips down.

You can also see that nodes have relatively similar loads. But for some reason only this node is capped at 250 MiB/s. Can this be related to AWS node/disk types provisioned for this test? What type of instance is this?

tbg · 2023-03-15T12:18:30Z

Not a failing run, but there's a good correlation between unstable log build-up and memory use, though it doesn't explain all of the memory usage. Looking at the last peak, for example, allocbytes (middle) peaked at 3.47GiB, and the corresponding peaks in raft receive queue (top) and unstable log (bottom) explain 170mb and 485mb of that, respectively. Unfortunately this still leaves a large chunk unaccounted for and there could be artifacts, like the metrics for top and bottom now displaying the "true" peaks thanks to the 10s poll interval.

erikgrinaker · 2023-03-16T14:48:41Z

Relabeling as GA-blocker for now. Known issue, and it's unclear if there's something we can/should do here for 23.1.

This is a WIP because the behavior when machine types with local SSD are used is unclear. For example, on AWS, roachtest prefers the c5d family, which all come with local SST storage. But looking into `awsStartupScriptTemplate`, it seems unclear how to make sure that the EBS disk(s) get mounted as /mnt/data1 (which is probably what the default should be). We could also entertain straight-up preventing combinations that would lead to an inhomogeneous RAID0. I imagine we'd have to take a round of failures to find all of the places in which it happens, but perhaps a "snitch" can be inserted instead so that we can detect all such callers and fix them up before arming the check. By the way, EBS disks on AWS come with a default of 125mb/s which is less than this RAID0 gets "most of the time" - so we can expect some tests to behave differently after this change. I still believe this is worth it - debugging is so much harder when you're on top of a storage that's hard to predict and doesn't resemble any production deployment. ---- I wasted weeks of my life on this before, and it almost happened again! When you run a roachtest that asks for an AWS cXd machine (i.e. compute optimized with NVMe local disk), and you specify a VolumeSize, you also get an EBS volume. Prior to these commit, these would be RAID0'ed together. This isn't something sane - the resulting gp3 EBS volume is very different from the local NVMe volume in every way, and it lead to hard-to-understand write throughput behavior. This commit defaults to *not* using RAID0. Touches cockroachdb#98767. Touches cockroachdb#98576. Touches cockroachdb#97019. Epic: none Release note: None

tbg · 2023-03-22T11:37:19Z

Wrapping up my investigation of this.

Main findings:

the roachtest was deployed on a weird RAID0 that led to inhomogeneous throughput available across the disks in the cluster. This is not something we tolerate well, and it's a known issue that you might then see memory build-up in the replication layer¹. The investigation here helped us understand better where memory is held alive, which was helpful.
without the weird RAID0, i.e. on vanilla (125mb/s throughput) EBS gp3 volumes, we don't see OOMs² but I saw some runs fail with a 5h timeout, and in these cases the restore speed undergoes a phase change a couple hours in³
@srosenberg is working on avoiding this weird RAID0⁴

I'm removing the GA-blocker label since there isn't a "new" issue here (but feel free to re-add it should the non-OOM problems here warrant it). OOMs in this test are expected until #98783 is addressed or the hacky version specific to this test in #98767 is merged (it's the DR team's call to make, please let me know). But, after that, the test is going to continue to occasionally fail with the two problems pointed out in #99206, so DR should look into those as well.

msbutler · 2023-03-22T14:26:49Z

@tbg thank you for a great investigation and for new analysis of potentially a new restore perf bug. We're going to merge this hack to the roachtest while @srosenberg works on the long term fix.

…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs cockroachdb#98783 Fixes cockroachdb#97019 Release note: none

98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration. 0 failures after 15000+ runs Fixes #91600. Epic: none Release note: None 99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none 99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist Fixes: #98242 Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats. Epic: none Release note: None 99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713 Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream. Release note: None 100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz Fixes #100098. Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none

cockroach-teamcity added this to the 23.1 milestone Feb 12, 2023

blathers-crl bot added the T-disaster-recovery label Feb 12, 2023

rhu713 self-assigned this Feb 13, 2023

rhu713 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 14, 2023

msbutler added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 19, 2023

pav-kv self-assigned this Feb 23, 2023

exalate-issue-sync bot unassigned rhu713 Feb 23, 2023

shermanCRL assigned msbutler Feb 23, 2023

tbg mentioned this issue Mar 16, 2023

roachtest: use c5, not c5d, for restore 8tb test #98767

Closed

erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 16, 2023

tbg mentioned this issue Mar 16, 2023

[wip] roachprod: don't use RAID0 by default #98782

Closed

tbg mentioned this issue Mar 16, 2023

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

Closed

tbg added T-disaster-recovery and removed A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Mar 22, 2023

tbg removed their assignment Mar 22, 2023

exalate-issue-sync bot assigned msbutler Mar 22, 2023

msbutler removed the GA-blocker label Mar 24, 2023

cockroach-teamcity mentioned this issue Mar 26, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #99608

Closed

msbutler mentioned this issue Mar 27, 2023

backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests #99723

Merged

craig bot closed this as completed in 64edfd1 Mar 30, 2023

blathers-crl bot mentioned this issue Mar 30, 2023

release-23.1: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests #100136

Merged

rhu713 mentioned this issue Apr 4, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #100341

Closed

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

cockroach-teamcity commented Feb 12, 2023 •

edited by cockroach-jira-scripts

Loading

rhu713 commented Feb 14, 2023

cockroach-teamcity commented Feb 15, 2023

msbutler commented Feb 15, 2023 •

edited

Loading

cockroach-teamcity commented Feb 19, 2023

cockroach-teamcity commented Feb 21, 2023

cockroach-teamcity commented Feb 22, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023 •

edited

Loading

msbutler commented Feb 23, 2023

pav-kv commented Feb 23, 2023 •

edited

Loading

tbg commented Mar 15, 2023

erikgrinaker commented Mar 16, 2023

tbg commented Mar 22, 2023 •

edited

Loading

msbutler commented Mar 22, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

Comments

cockroach-teamcity commented Feb 12, 2023 • edited by cockroach-jira-scripts Loading

rhu713 commented Feb 14, 2023

cockroach-teamcity commented Feb 15, 2023

msbutler commented Feb 15, 2023 • edited Loading

cockroach-teamcity commented Feb 19, 2023

cockroach-teamcity commented Feb 21, 2023

cockroach-teamcity commented Feb 22, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023 • edited Loading

pav-kv commented Feb 23, 2023 • edited Loading

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023

pav-kv commented Feb 23, 2023 • edited Loading

msbutler commented Feb 23, 2023

pav-kv commented Feb 23, 2023 • edited Loading

tbg commented Mar 15, 2023

erikgrinaker commented Mar 16, 2023

tbg commented Mar 22, 2023 • edited Loading

Footnotes

msbutler commented Mar 22, 2023

cockroach-teamcity commented Feb 12, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Feb 15, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023 •

edited

Loading

pav-kv commented Feb 23, 2023 •

edited

Loading

tbg commented Mar 22, 2023 •

edited

Loading