-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod,roachtest: don't RAID0 local SSD and PD/EBS #98783
Comments
cc @cockroachdb/test-eng |
This works around cockroachdb#98783: ``` Instance type c5.2xlarge ``` Now the roachtest runs on standard EBS volumes (provisioned to 125mb/s, i.e. pretty weak ones): ``` $ df -h /mnt/data1/ Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 2.0T 4.0G 2.0T 1% /mnt/data1 $ sudo nvme list | grep nvme1n1 /dev/nvme1n1 vol065ed9110066bb362 Amazon Elastic Block Store 1 2.15 TB / 2.15 TB 512 B + 0 B 1.0 ``` Let's see how this fares. The theory is that the test previously failed failed due to RAID0 because some nodes would unpredictably be slower than others (depending on the striping, etc, across the raided inhomogeneous volumes), which we don't handle well. Now, there's symmetry and hopefully things will be slower (since we only have 125mb/s per volume now) but functional, i.e. no more OOMs. I verified this via ``` ./pkg/cmd/roachtest/roachstress.sh -c 10 restore/tpce/8TB/aws/nodes=10/cpus=8 -- --cloud aws --parallelism 1 ``` Epic: CRDB-25503 Release note: None
This is essentially a duplicate of [1], if I am reading it correctly. Suffices to say that this configuration issue has caused time wastage. I'll post the findings from the roachtest [1] #82423 |
…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs cockroachdb#98783 Fixes cockroachdb#97019 Release note: none
…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs cockroachdb#98783 Fixes cockroachdb#97019 Release note: none
98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration. 0 failures after 15000+ runs Fixes #91600. Epic: none Release note: None 99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none 99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist Fixes: #98242 Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats. Epic: none Release note: None 99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713 Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream. Release note: None 100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz Fixes #100098. Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none
After cockroachdb#99723 merged as a bandaid for cockroachdb#98783, the aws roachtest nightly began to panic because of a different roachtest papercut cockroachdb#96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr cockroachdb#99402 merges. Epic: None Release note: None
99312: sqlsmith: add DEFAULT expressions to newly added columns r=mgartner a=mgartner Sqlsmith now builds `ALTER TABLE .. ADD COLUMN .. DEFAULT` statements with default expressions that have different types than the column type. This is allowed if the default expression type can be assignment-casted to the column's type. Fixes #98133 Release note: None 99348: testutils: move default test tenant message r=rharding6373 a=herkolategan In order to reduce logging noise but still inform test authors of the default test tenant, the message has been moved to where there is a `testing.TB` interface. Epic: CRDB-18499 99835: opt/execbuilder: add panic catching to buildRoutinePlanGenerator r=mgartner a=mgartner This commit adds a panic catcher to callback functions created in execbuilder and invoked during evaluation of UDFs and correlated subqueries. It matches the panic catcher logic in `buildApplyJoin`. Fixes #98786 Release note: None 100267: roachtest: own autoupgrade to TestEng r=renatolabs a=tbg Discussed in #99479. Epic: none Release note: None 100286: roachtest: prevent aws roachtest panic r=rail a=msbutler After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to panic because of a different roachtest papercut #96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr #99402 merges. Epic: None Release note: None 100294: tenantcapabilitiestestutils: add a missing default case r=ajwerner a=ajwerner The test should fail if we ever add a new type of capability and use it in the data driven test but don't update the test to handle it. Epic: none Follow-up from #100217 (review) Release note: None 100296: rpc: correctly check for nil before cast r=ajwerner a=andrewbaptist As part of the fix of #99104, a cast without a nil check was introduced. This PR addresses that by only casting if it is known to be not nil. Epic: none Fixes: #100275 Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Herko Lategan <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: ajwerner <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>
After #99723 merged as a bandaid for #98783, the aws roachtest nightly began to panic because of a different roachtest papercut #96655. Specifically, because roachtest filters which tests run on which cloud within the evaluation of the test closure, tests meant to run on gce will still get registered in an AWS run. During the registration of the gce test `restore/tpce/400GB/gce/nodes=4/cpus=8/lowmem` _on aws_, the aws test harness panics because the aws roachprod implementation does not have a low memory cpu configuration. This patch prevents this panic and should be reverted once the pr #99402 merges. Epic: None Release note: None
This popped up again in a bunch of AWS roachtest failures on 23.1 (e.g. #106013), where devices are RAIDed by default, breaking the test which does not expect this. |
119906: roachprod: RAID0 only network disks if both local and network are present r=srosenberg a=nameisbhaskar Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: #98783 Epic: none Co-authored-by: Bhaskarjyoti Bora <[email protected]>
…sent Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: #98783 Epic: none
…sent Today, if a machine has both local and network disks, both the disks are selected for RAID'ing. But, RAID'ing different types of disks causes performance differences. To address this, local disks are ignored for RAID'ing only if network disks are present. Fixes: cockroachdb#98783 Epic: none
Describe the problem
See #98782.
To Reproduce
Run the test in #97019.
Expected
I think RAID'ing wildly different storage media together should be a very explicit opt-in. Also, nobody will take this opt-in, so it maybe shouldn't even be possible.
When multiple EBS/PD-SSD volumes are created however, it makes some sense to RAID them together, however again I am skeptical (read: totally critical) of that being a good default.
Some discussion in #98782 (comment).
Jira issue: CRDB-25519
The text was updated successfully, but these errors were encountered: