-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] roachprod: don't use RAID0 by default #98782
Conversation
This is a WIP because the behavior when machine types with local SSD are used is unclear. For example, on AWS, roachtest prefers the c5d family, which all come with local SST storage. But looking into `awsStartupScriptTemplate`, it seems unclear how to make sure that the EBS disk(s) get mounted as /mnt/data1 (which is probably what the default should be). We could also entertain straight-up preventing combinations that would lead to an inhomogeneous RAID0. I imagine we'd have to take a round of failures to find all of the places in which it happens, but perhaps a "snitch" can be inserted instead so that we can detect all such callers and fix them up before arming the check. By the way, EBS disks on AWS come with a default of 125mb/s which is less than this RAID0 gets "most of the time" - so we can expect some tests to behave differently after this change. I still believe this is worth it - debugging is so much harder when you're on top of a storage that's hard to predict and doesn't resemble any production deployment. ---- I wasted weeks of my life on this before, and it almost happened again! When you run a roachtest that asks for an AWS cXd machine (i.e. compute optimized with NVMe local disk), and you specify a VolumeSize, you also get an EBS volume. Prior to these commit, these would be RAID0'ed together. This isn't something sane - the resulting gp3 EBS volume is very different from the local NVMe volume in every way, and it lead to hard-to-understand write throughput behavior. This commit defaults to *not* using RAID0. Touches cockroachdb#98767. Touches cockroachdb#98576. Touches cockroachdb#97019. Epic: none Release note: None
Another thing and maybe easier, we can make sure we don't use the aws |
Using
I think it's reasonable to default to RAID0 only local NVMes. In the above example, we would auto-RAID0 the two local NVMes and warn the user that the remaining remote disk remains unused [2], [3]. E.g.,
prints two warning messages,
before the VM is created. (See [1] https://github.com/linux-nvme/nvme-cli |
|
This is a WIP because the behavior when machine types with local SSD are
used is unclear. For example, on AWS, roachtest prefers the c5d family,
which all come with local SST storage. But looking into
awsStartupScriptTemplate
, it seems unclear how to make sure that theEBS disk(s) get mounted as /mnt/data1 (which is probably what the
default should be).
We could also entertain straight-up preventing combinations that would
lead to an inhomogeneous RAID0. I imagine we'd have to take a round of
failures to find all of the places in which it happens, but perhaps
a "snitch" can be inserted instead so that we can detect all such
callers and fix them up before arming the check.
By the way, EBS disks on AWS come with a default of 125mb/s which is
less than this RAID0 gets "most of the time" - so we can expect some
tests to behave differently after this change. I still believe this
is worth it - debugging is so much harder when you're on top of a
storage that's hard to predict and doesn't resemble any production
deployment.
I wasted weeks of my life on this before, and it almost happened again!
When you run a roachtest that asks for an AWS cXd machine (i.e. compute
optimized with NVMe local disk), and you specify a VolumeSize, you also
get an EBS volume. Prior to these commit, these would be RAID0'ed
together.
This isn't something sane - the resulting gp3 EBS volume is very
different from the local NVMe volume in every way, and it lead to
hard-to-understand write throughput behavior.
This commit defaults to not using RAID0.
Touches #98767.
Touches #98576.
Touches #97019.
Epic: none
Release note: None