Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

Closed
tbg opened this issue Mar 16, 2023 · 3 comments · Fixed by #119906
Closed

roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783

tbg opened this issue Mar 16, 2023 · 3 comments · Fixed by #119906
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. quality-friday A good issue to work on on Quality Friday T-testeng TestEng Team

Comments

@tbg
Copy link
Member

tbg commented Mar 16, 2023

Describe the problem

See #98782.

To Reproduce

Run the test in #97019.

Expected

I think RAID'ing wildly different storage media together should be a very explicit opt-in. Also, nobody will take this opt-in, so it maybe shouldn't even be possible.

When multiple EBS/PD-SSD volumes are created however, it makes some sense to RAID them together, however again I am skeptical (read: totally critical) of that being a good default.

Some discussion in #98782 (comment).

Jira issue: CRDB-25519

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team labels Mar 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 16, 2023

cc @cockroachdb/test-eng

tbg added a commit to tbg/cockroach that referenced this issue Mar 20, 2023
This works around cockroachdb#98783:

```
Instance type
c5.2xlarge
```

Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s,
i.e. pretty weak ones):

```
$ df -h /mnt/data1/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    2.0T  4.0G  2.0T   1% /mnt/data1
$ sudo nvme list | grep nvme1n1
/dev/nvme1n1     vol065ed9110066bb362 Amazon Elastic Block Store               1           2.15  TB /   2.15  TB    512   B +  0 B   1.0
```

Let's see how this fares.

The theory is that the test previously failed failed due to RAID0
because some nodes would unpredictably be slower than others (depending
on the striping, etc, across the raided inhomogeneous volumes), which we
don't handle well. Now, there's symmetry and hopefully things will be
slower (since we only have 125mb/s per volume now) but functional, i.e.
no more OOMs.

I verified this via

```
./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1
```

Epic: CRDB-25503
Release note: None
@srosenberg
Copy link
Member

This is essentially a duplicate of [1], if I am reading it correctly. Suffices to say that this configuration issue has caused time wastage. I'll post the findings from the roachtest --dry-run to determine the frequency of this misconfiguration. Subsequently, we'll decide on the best fix, either a safe default (e.g., RAID0 only local drives) or fail and let the user specify.

[1] #82423

msbutler added a commit to msbutler/cockroach that referenced this issue Mar 27, 2023
…tests

A long restore roachtest perf investigation revealed that roachprod can RAID0
local storage and AWS GP3 storage, a configuration that does not mix well with
CRDB and does not reflect a reasonable customer environment. This patch avoids
this RAID0ing in the restore roachtests, stabilizing test performance.

Informs cockroachdb#98783

Fixes cockroachdb#97019

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Mar 30, 2023
…tests

A long restore roachtest perf investigation revealed that roachprod can RAID0
local storage and AWS GP3 storage, a configuration that does not mix well with
CRDB and does not reflect a reasonable customer environment. This patch avoids
this RAID0ing in the restore roachtests, stabilizing test performance.

Informs cockroachdb#98783

Fixes cockroachdb#97019

Release note: none
craig bot pushed a commit that referenced this issue Mar 30, 2023
98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling

This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration.

0 failures after 15000+ runs

Fixes #91600.

Epic: none

Release note: None

99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler

A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance.

Informs #98783

Fixes #97019

Release note: none

99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist

Fixes: #98242

Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats.

Epic: none

Release note: None

99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713

Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream.

Release note: None

100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz

Fixes #100098.

Release note: None

Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 30, 2023
…tests

A long restore roachtest perf investigation revealed that roachprod can RAID0
local storage and AWS GP3 storage, a configuration that does not mix well with
CRDB and does not reflect a reasonable customer environment. This patch avoids
this RAID0ing in the restore roachtests, stabilizing test performance.

Informs #98783

Fixes #97019

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Mar 31, 2023
After cockroachdb#99723 merged as a bandaid for cockroachdb#98783, the aws roachtest nightly began to
panic because of a different roachtest papercut cockroachdb#96655. Specifically, because
roachtest filters which tests run on which cloud within the evaluation of the
test closure, tests meant to run on gce will still get registered in an AWS
run. During the registration of the gce test
`restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness
panics because the aws roachprod implementation does not have a low memory cpu
configuration. This patch prevents this panic and should be reverted once
the pr cockroachdb#99402 merges.

Epic: None

Release note: None
craig bot pushed a commit that referenced this issue Mar 31, 2023
99312: sqlsmith: add DEFAULT expressions to newly added columns r=mgartner a=mgartner

Sqlsmith now builds `ALTER TABLE .. ADD COLUMN .. DEFAULT` statements
with default expressions that have different types than the column type.
This is allowed if the default expression type can be assignment-casted
to the column's type.

Fixes #98133

Release note: None


99348: testutils: move default test tenant message r=rharding6373 a=herkolategan

In order to reduce logging noise but still inform test authors of the default test tenant, the message has been moved to where there is a `testing.TB` interface.

Epic: CRDB-18499

99835: opt/execbuilder: add panic catching to buildRoutinePlanGenerator r=mgartner a=mgartner

This commit adds a panic catcher to callback functions created in
execbuilder and invoked during evaluation of UDFs and correlated
subqueries. It matches the panic catcher logic in `buildApplyJoin`.

Fixes #98786

Release note: None


100267: roachtest: own autoupgrade to TestEng r=renatolabs a=tbg

Discussed in #99479.

Epic: none
Release note: None


100286: roachtest: prevent aws roachtest panic r=rail a=msbutler

After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to panic because of a different roachtest papercut #96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test
`restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr #99402 merges.

Epic: None

Release note: None

100294: tenantcapabilitiestestutils: add a missing default case r=ajwerner a=ajwerner

The test should fail if we ever add a new type of capability and use it in the data driven test but don't update the test to handle it.

Epic: none

Follow-up from #100217 (review)

Release note: None

100296: rpc: correctly check for nil before cast r=ajwerner a=andrewbaptist

As part of the fix of #99104, a cast without a nil check was introduced. This PR addresses that by only casting if it is known to be not nil.

Epic: none
Fixes: #100275
Release note: None

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Herko Lategan <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: ajwerner <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 31, 2023
After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to
panic because of a different roachtest papercut #96655. Specifically, because
roachtest filters which tests run on which cloud within the evaluation of the
test closure, tests meant to run on gce will still get registered in an AWS
run. During the registration of the gce test
`restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness
panics because the aws roachprod implementation does not have a low memory cpu
configuration. This patch prevents this panic and should be reverted once
the pr #99402 merges.

Epic: None

Release note: None
@erikgrinaker
Copy link
Contributor

This popped up again in a bunch of AWS roachtest failures on 23.1 (e.g. #106013), where devices are RAIDed by default, breaking the test which does not expect this.

@tbg tbg added the quality-friday A good issue to work on on Quality Friday label Jul 20, 2023
craig bot pushed a commit that referenced this issue Mar 13, 2024
119906: roachprod: RAID0 only network disks if both local and network are present r=srosenberg a=nameisbhaskar

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none

Co-authored-by: Bhaskarjyoti Bora <[email protected]>
@craig craig bot closed this as completed in 6fd7ff8 Mar 13, 2024
blathers-crl bot pushed a commit that referenced this issue Mar 13, 2024
…sent

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none
jasminejsun pushed a commit to jasminejsun/cockroach that referenced this issue Mar 18, 2024
…sent

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: cockroachdb#98783
Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. quality-friday A good issue to work on on Quality Friday T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants