roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

tbg · 2023-03-16T16:10:29Z

Describe the problem

To Reproduce

Run the test in #97019.

Expected

I think RAID'ing wildly different storage media together should be a very explicit opt-in. Also, nobody will take this opt-in, so it maybe shouldn't even be possible.

When multiple EBS/PD-SSD volumes are created however, it makes some sense to RAID them together, however again I am skeptical (read: totally critical) of that being a good default.

Some discussion in #98782 (comment).

Jira issue: CRDB-25519

blathers-crl · 2023-03-16T16:10:46Z

cc @cockroachdb/test-eng

This works around cockroachdb#98783: ``` Instance type c5.2xlarge ``` Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s, i.e. pretty weak ones): ``` $ df -h /mnt/data1/ Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 2.0T 4.0G 2.0T 1% /mnt/data1 $ sudo nvme list | grep nvme1n1 /dev/nvme1n1 vol065ed9110066bb362 Amazon Elastic Block Store 1 2.15 TB / 2.15 TB 512 B + 0 B 1.0 ``` Let's see how this fares. The theory is that the test previously failed failed due to RAID0 because some nodes would unpredictably be slower than others (depending on the striping, etc, across the raided inhomogeneous volumes), which we don't handle well. Now, there's symmetry and hopefully things will be slower (since we only have 125mb/s per volume now) but functional, i.e. no more OOMs. I verified this via ``` ./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1 ``` Epic: CRDB-25503 Release note: None

srosenberg · 2023-03-21T23:08:47Z

This is essentially a duplicate of [1], if I am reading it correctly. Suffices to say that this configuration issue has caused time wastage. I'll post the findings from the roachtest --dry-run to determine the frequency of this misconfiguration. Subsequently, we'll decide on the best fix, either a safe default (e.g., RAID0 only local drives) or fail and let the user specify.

[1] #82423

…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs cockroachdb#98783 Fixes cockroachdb#97019 Release note: none

98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration. 0 failures after 15000+ runs Fixes #91600. Epic: none Release note: None 99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none 99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist Fixes: #98242 Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats. Epic: none Release note: None 99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713 Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream. Release note: None 100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz Fixes #100098. Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none

After cockroachdb#99723 merged as a bandaid for cockroachdb#98783, the aws roachtest nightly began to panic because of a different roachtest papercut cockroachdb#96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr cockroachdb#99402 merges. Epic: None Release note: None

99312: sqlsmith: add DEFAULT expressions to newly added columns r=mgartner a=mgartner Sqlsmith now builds `ALTER TABLE .. ADD COLUMN .. DEFAULT` statements with default expressions that have different types than the column type. This is allowed if the default expression type can be assignment-casted to the column's type. Fixes #98133 Release note: None 99348: testutils: move default test tenant message r=rharding6373 a=herkolategan In order to reduce logging noise but still inform test authors of the default test tenant, the message has been moved to where there is a `testing.TB` interface. Epic: CRDB-18499 99835: opt/execbuilder: add panic catching to buildRoutinePlanGenerator r=mgartner a=mgartner This commit adds a panic catcher to callback functions created in execbuilder and invoked during evaluation of UDFs and correlated subqueries. It matches the panic catcher logic in `buildApplyJoin`. Fixes #98786 Release note: None 100267: roachtest: own autoupgrade to TestEng r=renatolabs a=tbg Discussed in #99479. Epic: none Release note: None 100286: roachtest: prevent aws roachtest panic r=rail a=msbutler After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to panic because of a different roachtest papercut #96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr #99402 merges. Epic: None Release note: None 100294: tenantcapabilitiestestutils: add a missing default case r=ajwerner a=ajwerner The test should fail if we ever add a new type of capability and use it in the data driven test but don't update the test to handle it. Epic: none Follow-up from #100217 (review) Release note: None 100296: rpc: correctly check for nil before cast r=ajwerner a=andrewbaptist As part of the fix of #99104, a cast without a nil check was introduced. This PR addresses that by only casting if it is known to be not nil. Epic: none Fixes: #100275 Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Herko Lategan <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: ajwerner <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>

After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to panic because of a different roachtest papercut #96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr #99402 merges. Epic: None Release note: None

erikgrinaker · 2023-07-03T09:35:34Z

This popped up again in a bunch of AWS roachtest failures on 23.1 (e.g. #106013), where devices are RAIDed by default, breaking the test which does not expect this.

119906: roachprod: RAID0 only network disks if both local and network are present r=srosenberg a=nameisbhaskar Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: #98783 Epic: none Co-authored-by: Bhaskarjyoti Bora <[email protected]>

…sent Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: #98783 Epic: none

…sent Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: cockroachdb#98783 Epic: none

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team labels Mar 16, 2023

tbg assigned srosenberg Mar 20, 2023

This was referenced Mar 20, 2023

roachtest: use c5, not c5d, for restore 8tb test #98767

Closed

[wip] roachprod: don't use RAID0 by default #98782

Closed

tbg mentioned this issue Mar 22, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

Closed

msbutler mentioned this issue Mar 27, 2023

backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests #99723

Merged

blathers-crl bot mentioned this issue Mar 30, 2023

release-23.1: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests #100136

Merged

msbutler mentioned this issue Mar 31, 2023

roachtest: prevent aws roachtest panic #100286

Merged

blathers-crl bot mentioned this issue Mar 31, 2023

release-23.1: roachtest: prevent aws roachtest panic #100308

Merged

erikgrinaker mentioned this issue Jul 3, 2023

roachtest: AWS provider sets up local SSD when not requested #106058

Closed

erikgrinaker mentioned this issue Jul 19, 2023

roachtest: failover/non-system/disk-stall failed #107052

Closed

tbg added the quality-friday A good issue to work on on Quality Friday label Jul 20, 2023

nameisbhaskar mentioned this issue Mar 5, 2024

roachprod: RAID0 only network disks if both local and network are present #119906

Merged

craig bot closed this as completed in 6fd7ff8 Mar 13, 2024

blathers-crl bot mentioned this issue Mar 13, 2024

release-23.2: roachprod: RAID0 only network disks if both local and network are present #120396

Merged

nameisbhaskar mentioned this issue Mar 13, 2024

release-23.1: roachprod: RAID0 only network disks if both local and network are present #120429

Merged

srosenberg mentioned this issue Mar 28, 2024

roachprod / roachtest: reconsider how local SSDs are configured #82423

Closed

github-project-automation bot added this to Test Engineering Aug 28, 2024

github-project-automation bot moved this to Done in Test Engineering Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

tbg commented Mar 16, 2023 •

edited

Loading

blathers-crl bot commented Mar 16, 2023

srosenberg commented Mar 21, 2023

erikgrinaker commented Jul 3, 2023

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

Comments

tbg commented Mar 16, 2023 • edited Loading

blathers-crl bot commented Mar 16, 2023

srosenberg commented Mar 21, 2023

erikgrinaker commented Jul 3, 2023

tbg commented Mar 16, 2023 •

edited

Loading