Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

Closed
cockroach-teamcity opened this issue Feb 12, 2023 · 33 comments · Fixed by #99723
Closed

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

cockroach-teamcity opened this issue Feb 12, 2023 · 33 comments · Fixed by #99723
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 12, 2023

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2a7edbeb0737b1309064c25c641a309c2980d9ba:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: full test specs: restore/tpce/8TB/inc-count=48/aws/nodes=10/cpus=8/volSize=2000GB: output in run_163914.880098375_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05 20:45:00'" returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_163914.888338965_n1_cockroach-sql-insecu.log: exit status 1

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-24470

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 12, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 12, 2023
@rhu713 rhu713 self-assigned this Feb 13, 2023
@rhu713
Copy link
Contributor

rhu713 commented Feb 14, 2023

Error is

E230212 19:04:19.947035 2000 cloud/cloud_io.go:223 ⋮ [T1,n10,f‹7235878f›,job=839350871328948225,distsql.gateway=‹1›] 9061  Read err: read tcp ‹10.12.21.166:37394› -> ‹3.5.28.175:443›: read: connection timed out

I believe this is already part of the default retryable errors. We don't set custom timeouts or http clients anywhere in s3_storage.go, so I'm guessing the timeouts must come from one of the timeouts in http.DefaultTransport.

We can look to have a custom http client in s3_storage with longer timeouts, but I'm going to remove the release blocker for now.

@rhu713 rhu713 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 14, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 64a867acd25c0a214209957eefb6483d1158b4f0:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:989).runTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

msbutler commented Feb 15, 2023

note that in last night's run, I accidentally changed the test to restore from 48 incremental backups instead of 11. It still may be interesting to investigate this timeout. I'm fixing that here:
#97217 (review)

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 3d054f37c7c87f53cb56fac4e5500f0d1130d09a:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@msbutler msbutler added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 19, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ dd2749ae4ab61eed2f99238acb74e8d3c6b4cb1d:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: read tcp 172.17.0.3:49252 -> 3.21.245.171:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@pav-kv pav-kv self-assigned this Feb 23, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

Seeing a bunch of range splits to accommodate the import, up to 16:41:

Screenshot 2023-02-23 at 11 52 53

At 16:40, a stream of SST ingestions started. 239 messages between 16:40-16:48 (when crash happened) like this:

I230221 16:40:21.116097 267 kv/kvserver/replica_proposal.go:546 ⋮ [T1,n2,s2,r919/2:‹/Table/134/1/20000012{05…-51…}›,raft] 3557  ingesting SST of size 8.6 MiB at index 56 took 0.55s (0.14s on which in PreIngestDelay)

Shortly, started seeing these errors in the logs:

W230221 16:42:49.364379 251 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1005/2:‹/Table/134/1/2000001{75…-80…}›] 5351  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:44:31.921049 233 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1008/2:‹/Table/134/1/20000018{09…-61…}›] 6456  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:44:39.108249 246 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r2114/1:‹/Table/134/1/2000040{77…-83…}›] 6561  1 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:45:09.598503 226 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1049/2:‹/Table/134/1/2000003{85…-90…}›] 7076  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:26.415719 264 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1025/2:‹/Table/134/1/200000{195…-200…}›] 8268  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:29.875566 259 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1047/4:‹/Table/134/1/20000152{31…-77…}›] 8298  4 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
W230221 16:47:46.333238 255 go.etcd.io/raft/v3/raft.go:765 ⋮ [T1,n2,s2,r1228/2:‹/Table/134/1/2000015{27…-32…}›] 8488  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal

The default uncommitted entry size limit is 16 MiB, which is also soft, so a single SST always fits it, but here we see that an attempt to add another one exceeds the limit and gets dropped.

This indicates that there is probably a build-up of AddSST commands in memory.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

Given that there are ~4400 replicas ingesting, and each SST is ~8.3 MiB (also 3.2 MiB appears in logs less frequently), the build-up can easily reach tens of GiB.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

Looks like #73376.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

In this test n2 stands out with a slow Raft log commit latency:

Screenshot 2023-02-23 at 12 21 40

(50% too, not just 99%)

Which is probably why this node ended up with an in-memory build-up.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

n2 is also getting "recovery" snapshots.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

It's also reporting slow heartbeats (also seen the corresponding messages in logs):

Screenshot 2023-02-23 at 12 26 53

Liveness heartbeats do a local store write, which is probably slow here (given the log commit latency of the same O(second)).

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

There is an engine backpressure for AddSST on n2:

Screenshot 2023-02-23 at 12 36 57

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

Draft notes while investigating with @tbg:

  • disk seems overloaded: write flatline at 250mb/s, when there are reads on this node it dips down, so just overloaded disk overall. conjecture: n2 ran out of EBS burst and so got throttled, maybe other nodes didn't due to imbalance, we're not sure, don't know instance type etc
  • raft oom is just a symptom of us not handling disk overload too well here --> ping admission control folks (irfan): disk is "obviously" under stress, but LSM is "fine", addssts get throttled above raft, conjecture is that follower SSTs come in, get into receive queue, get flushed into RawNode's unstable log, and then there's unbounded in mem buffering there. Would understand this as an admission control scenario, should discuss. Follower pausing wouldn't have helped here. n2 was probably not enforced by proposal quota on the leaseholders any more, but why? Maybe a snapshot
  • raft recv queue was at <1mb throughout
  • understanding the proposal quota here would be helpful. We'd like to know on how many leaseholders n2 was being ignored for quota enforcement (or how many MB of entries ignored n2 for quota). If it was ignored, the faster quorum could make progress at its own pace; n2 would drink from a firehose with too much flow for its disk, so unbounded memory usage would occur in the unstable log. This is likely what we're seeing here but it is not easy to prove it from the metrics we have.
  • if proposal quota was enforced for n2, we should see at most num_leaseholders * 16mb piling up at n2 (in practice less, since not all ranges have active writes)
  • metric for size of unstable log would be nice too since unbounded mem growth can occur there
  • is this an old test, why is it failing now? Maybe it has to do with async appends which allows more entries to enter the unstable log quickly. Go through history of test, see when it started failing, see if that lines up with any notable PRs.
  • it's a new test: backupccl: introduce new restore roachtest framework #94143

@msbutler
Copy link
Collaborator

@pavelkalinnikov this is a new test that has a added much better restore workload coverage. This test uses a much better workload (tpce >>>> bank), and restores more data (8TB > 2 TB).

@rhu713 landed a few PRs that significantly changed restore, summarized in this backport: #97210, latest of which merged last week.

My only guess to a correlation between his PR's and this failure is that his may make restore more prone to unevenly distributing work. @rhu713 any thoughts here?

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 23, 2023

@msbutler @rhu713 One of the key observations was this:

Screenshot 2023-02-23 at 15 32 01

You can barely see it here (the yellow line), but:

  • Disk writes on n2 is a flat line capped at 250 MiB/s
  • Except during a couple of "Disk Read" spikes during which the "Disk Write" dips down.

You can also see that nodes have relatively similar loads. But for some reason only this node is capped at 250 MiB/s. Can this be related to AWS node/disk types provisioned for this test? What type of instance is this?

@tbg
Copy link
Member

tbg commented Mar 15, 2023

Not a failing run, but there's a good correlation between unstable log build-up and memory use, though it doesn't explain all of the memory usage. Looking at the last peak, for example, allocbytes (middle) peaked at 3.47GiB, and the corresponding peaks in raft receive queue (top) and unstable log (bottom) explain 170mb and 485mb of that, respectively. Unfortunately this still leaves a large chunk unaccounted for and there could be artifacts, like the metrics for top and bottom now displaying the "true" peaks thanks to the 10s poll interval.

image

@erikgrinaker erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 16, 2023
@erikgrinaker
Copy link
Contributor

Relabeling as GA-blocker for now. Known issue, and it's unclear if there's something we can/should do here for 23.1.

tbg added a commit to tbg/cockroach that referenced this issue Mar 16, 2023
This is a WIP because the behavior when machine types with local SSD are
used is unclear. For example, on AWS, roachtest prefers the c5d family,
which all come with local SST storage. But looking into
`awsStartupScriptTemplate`, it seems unclear how to make sure that the
EBS disk(s) get mounted as /mnt/data1 (which is probably what the
default should be).

We could also entertain straight-up preventing combinations that would
lead to an inhomogeneous RAID0. I imagine we'd have to take a round of
failures to find all of the places in which it happens, but perhaps
a "snitch" can be inserted instead so that we can detect all such
callers and fix them up before arming the check.

By the way, EBS disks on AWS come with a default of 125mb/s which is
less than this RAID0 gets "most of the time" - so we can expect some
tests to behave differently after this change. I still believe this
is worth it - debugging is so much harder when you're on top of a
storage that's hard to predict and doesn't resemble any production
deployment.

----

I wasted weeks of my life on this before, and it almost happened again!
When you run a roachtest that asks for an AWS cXd machine (i.e. compute
optimized with NVMe local disk), and you specify a VolumeSize, you also
get an EBS volume. Prior to these commit, these would be RAID0'ed
together.

This isn't something sane - the resulting gp3 EBS volume is very
different from the local NVMe volume in every way, and it lead to
hard-to-understand write throughput behavior.

This commit defaults to *not* using RAID0.

Touches cockroachdb#98767.
Touches cockroachdb#98576.
Touches cockroachdb#97019.

Epic: none
Release note: None
@tbg
Copy link
Member

tbg commented Mar 22, 2023

Wrapping up my investigation of this.

Main findings:

  • the roachtest was deployed on a weird RAID0 that led to inhomogeneous throughput available across the disks in the cluster. This is not something we tolerate well, and it's a known issue that you might then see memory build-up in the replication layer1. The investigation here helped us understand better where memory is held alive, which was helpful.
  • without the weird RAID0, i.e. on vanilla (125mb/s throughput) EBS gp3 volumes, we don't see OOMs2 but I saw some runs fail with a 5h timeout, and in these cases the restore speed undergoes a phase change a couple hours in3
  • @srosenberg is working on avoiding this weird RAID04

I'm removing the GA-blocker label since there isn't a "new" issue here (but feel free to re-add it should the non-OOM problems here warrant it). OOMs in this test are expected until #98783 is addressed or the hacky version specific to this test in #98767 is merged (it's the DR team's call to make, please let me know). But, after that, the test is going to continue to occasionally fail with the two problems pointed out in #99206, so DR should look into those as well.

Footnotes

  1. see https://github.com/cockroachdb/cockroach/issues/73376 and the epic this belongs to

  2. https://github.com/cockroachdb/cockroach/pull/98767

  3. https://github.com/cockroachdb/cockroach/issues/99206

  4. https://github.com/cockroachdb/cockroach/issues/98783

@tbg tbg added T-disaster-recovery and removed A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Mar 22, 2023
@tbg tbg removed their assignment Mar 22, 2023
@msbutler
Copy link
Collaborator

@tbg thank you for a great investigation and for new analysis of potentially a new restore perf bug. We're going to merge this hack to the roachtest while @srosenberg works on the long term fix.

msbutler added a commit to msbutler/cockroach that referenced this issue Mar 27, 2023
…tests

A long restore roachtest perf investigation revealed that roachprod can RAID0
local storage and AWS GP3 storage, a configuration that does not mix well with
CRDB and does not reflect a reasonable customer environment. This patch avoids
this RAID0ing in the restore roachtests, stabilizing test performance.

Informs cockroachdb#98783

Fixes cockroachdb#97019

Release note: none
craig bot pushed a commit that referenced this issue Mar 30, 2023
98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling

This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration.

0 failures after 15000+ runs

Fixes #91600.

Epic: none

Release note: None

99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler

A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance.

Informs #98783

Fixes #97019

Release note: none

99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist

Fixes: #98242

Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats.

Epic: none

Release note: None

99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713

Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream.

Release note: None

100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz

Fixes #100098.

Release note: None

Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in 64edfd1 Mar 30, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 30, 2023
…tests

A long restore roachtest perf investigation revealed that roachprod can RAID0
local storage and AWS GP3 storage, a configuration that does not mix well with
CRDB and does not reflect a reasonable customer environment. This patch avoids
this RAID0ing in the restore roachtests, stabilizing test performance.

Informs #98783

Fixes #97019

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
6 participants