-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod / roachtest: reconsider how local SSDs are configured #82423
Comments
Currently, roachprod has flags for enabling multiple stores on nodes in a cluster, defaulting to `false`. If more than one storage device is present on a node, the devices are combined into a RAID0 (striped) volume. The terms "using multiple disks" and "using multiple stores" are also used interchangeably. Roachtest has its own cluster spec option, `RAID0(enabled)` for enabling RAID0, passing through the `UseMultipleDisks` parameter to roachprod. The former is the negation of the latter (i.e. RAID0 implies _not_ using multiple disks, using multiple disks implies _not_ using RAID0, etc.) The combination of "using multiple disks", "using multiple stores" and "using RAID0" can result in some cognitive overhead. Simplify things by adopting the "using multiple stores" parlance. Replace the `RAID0` roachtest cluster spec option with the `MultipleStores` option, updating existing call-sites with the negation (i.e. `RAID0(true)` becomes `MultipleStores(false)`, etc.). Allow AWS roachtests to enable / disable multiple stores. Previously, this functionality was limited to GCP clusters. Touches cockroachdb#82423. Release note: None.
Currently, roachprod has flags for enabling multiple stores on nodes in a cluster, defaulting to `false`. If more than one storage device is present on a node, the devices are combined into a RAID0 (striped) volume. The terms "using multiple disks" and "using multiple stores" are also used interchangeably. Roachtest has its own cluster spec option, `RAID0(enabled)` for enabling RAID0, passing through the `UseMultipleDisks` parameter to roachprod. The former is the negation of the latter (i.e. RAID0 implies _not_ using multiple disks, using multiple disks implies _not_ using RAID0, etc.) The combination of "using multiple disks", "using multiple stores" and "using RAID0" can result in some cognitive overhead. Simplify things by adopting the "using multiple stores" parlance. Replace the `RAID0` roachtest cluster spec option with the `MultipleStores` option, updating existing call-sites with the negation (i.e. `RAID0(true)` becomes `MultipleStores(false)`, etc.). Allow AWS roachtests to enable / disable multiple stores. Previously, this functionality was limited to GCP clusters. Touches cockroachdb#82423. Release note: None.
We have marked this issue as stale because it has been inactive for |
The footgun has been disabled owing to [1]. We still use machine types with local (scratch) disks, both in AWS and Azure. These are unavoidable in some cases since new machine types don't provide the option without scratch disks. However, those will never be RAIDed with persistent disks. I am closing the issue on the basis that the original problem has been resolved. Feel free to reopen if there is still an issue wrt UX. [1] #98783 |
Is your feature request related to a problem? Please describe.
Currently, roachprod allows configuring multiple stores on a node when multiple storage devices are present. On GCP and AWS, whether or not to use multiple stores when multiple disks are present is controlled via the
--gce-enable-multiple-stores
and--aws-enable-multiple-stores
flags, respectively. Both default tofalse
.In the case that multiple disks are present, but multiple stores are disabled (the default), roachprod will add all devices to a software RAID0 (striped) volume.
Roachprod has the
--local-ssd
flag, defaulting totrue
.Roachtest also has the
PreferLocalSSD
option (defaulting totrue
, via the default value for the flag), which will attempt to use local SSDs in the case that:d
, e.g.m5d
).By default, on AWS, the machine type ends in
d
(see here). These machines have local "instance storage" - a variable number of disks (typically nvme devices) dependent on the machine type (more in the AWS docs here).There exists a footgun where a roachtest prefers no local SSD (directly or indirectly setting
PreferLocalSSD
in the roachtest cluster spec), and the test is run on AWS. In this case and EBS volume is created and attached to the node, in addition to one or more local SSDs that come with the AWS instance type. Furthermore, as the default is to not use multiple stores, the volumes are combined into a RAID0 volume.As seen in #82109, this eclectic mixture devices can result in unexpected and surprising performance, given that one more local SSDs is combined in a striped volume with a network-attached EBS volume (which has its own throughput and IOPs limits).
Describe the solution you'd like
Avoid the AWS instance store / EBS volume footgun, both in roachprod and roachtest.
I don't have a strong opinion on the UX, but here are some thoughts that could inform some improvements:
Consider updating the logic for the machine type to avoid the
d
instances in the case that the test is running on AWS and the preference is for no SSDs, to avoid risk of including additional instance storage SSDs.Consider the interaction of
PreferLocalSSD
and thespec.SSD
parameter. The combination of the two can get complicated on AWS - is it valid to set the former tofalse
when the latter is non-zero? On AWSspec.SSD
basically needs to alter the instance type to alter the number of SSDs.Do away with the whole "preference" concept for SSDs entirely. Make the spec explicit in that you either get SSDs or you do not. I can't think of a case where you'd want a mix of local SSDs and non-local SSDs (in a multi-store configuration, and definitely not in a RAID0 configuration).
Jira issue: CRDB-16424
The text was updated successfully, but these errors were encountered: