Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod / roachtest: reconsider how local SSDs are configured #82423

Closed
nicktrav opened this issue Jun 3, 2022 · 2 comments
Closed

roachprod / roachtest: reconsider how local SSDs are configured #82423

nicktrav opened this issue Jun 3, 2022 · 2 comments
Labels
A-roachprod C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Jun 3, 2022

Is your feature request related to a problem? Please describe.

Currently, roachprod allows configuring multiple stores on a node when multiple storage devices are present. On GCP and AWS, whether or not to use multiple stores when multiple disks are present is controlled via the --gce-enable-multiple-stores and --aws-enable-multiple-stores flags, respectively. Both default to false.

In the case that multiple disks are present, but multiple stores are disabled (the default), roachprod will add all devices to a software RAID0 (striped) volume.

Roachprod has the --local-ssd flag, defaulting to true.

Roachtest also has the PreferLocalSSD option (defaulting to true, via the default value for the flag), which will attempt to use local SSDs in the case that:

  • no particular volume size has been requested
  • if on AWS, the machine type supports it (typically a machine type ending in d, e.g. m5d).

By default, on AWS, the machine type ends in d (see here). These machines have local "instance storage" - a variable number of disks (typically nvme devices) dependent on the machine type (more in the AWS docs here).

There exists a footgun where a roachtest prefers no local SSD (directly or indirectly setting PreferLocalSSD in the roachtest cluster spec), and the test is run on AWS. In this case and EBS volume is created and attached to the node, in addition to one or more local SSDs that come with the AWS instance type. Furthermore, as the default is to not use multiple stores, the volumes are combined into a RAID0 volume.

As seen in #82109, this eclectic mixture devices can result in unexpected and surprising performance, given that one more local SSDs is combined in a striped volume with a network-attached EBS volume (which has its own throughput and IOPs limits).

Describe the solution you'd like

Avoid the AWS instance store / EBS volume footgun, both in roachprod and roachtest.

I don't have a strong opinion on the UX, but here are some thoughts that could inform some improvements:

Consider updating the logic for the machine type to avoid the d instances in the case that the test is running on AWS and the preference is for no SSDs, to avoid risk of including additional instance storage SSDs.

Consider the interaction of PreferLocalSSD and the spec.SSD parameter. The combination of the two can get complicated on AWS - is it valid to set the former to false when the latter is non-zero? On AWS spec.SSD basically needs to alter the instance type to alter the number of SSDs.

Do away with the whole "preference" concept for SSDs entirely. Make the spec explicit in that you either get SSDs or you do not. I can't think of a case where you'd want a mix of local SSDs and non-local SSDs (in a multi-store configuration, and definitely not in a RAID0 configuration).

Jira issue: CRDB-16424

@nicktrav nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-roachprod T-testeng TestEng Team labels Jun 3, 2022
nicktrav added a commit to nicktrav/cockroach that referenced this issue Jun 3, 2022
Currently, roachprod has flags for enabling multiple stores on nodes in
a cluster, defaulting to `false`. If more than one storage device is
present on a node, the devices are combined into a RAID0 (striped)
volume. The terms "using multiple disks" and "using multiple stores" are
also used interchangeably.

Roachtest has its own cluster spec option, `RAID0(enabled)` for enabling
RAID0, passing through the `UseMultipleDisks` parameter to roachprod.
The former is the negation of the latter (i.e. RAID0 implies _not_ using
multiple disks, using multiple disks implies _not_ using RAID0, etc.)

The combination of "using multiple disks", "using multiple stores" and
"using RAID0" can result in some cognitive overhead. Simplify things by
adopting the "using multiple stores" parlance.

Replace the `RAID0` roachtest cluster spec option with the
`MultipleStores` option, updating existing call-sites with the negation
(i.e. `RAID0(true)` becomes `MultipleStores(false)`, etc.).

Allow AWS roachtests to enable / disable multiple stores. Previously,
this functionality was limited to GCP clusters.

Touches cockroachdb#82423.

Release note: None.
tbg pushed a commit to tbg/cockroach that referenced this issue Jun 9, 2022
Currently, roachprod has flags for enabling multiple stores on nodes in
a cluster, defaulting to `false`. If more than one storage device is
present on a node, the devices are combined into a RAID0 (striped)
volume. The terms "using multiple disks" and "using multiple stores" are
also used interchangeably.

Roachtest has its own cluster spec option, `RAID0(enabled)` for enabling
RAID0, passing through the `UseMultipleDisks` parameter to roachprod.
The former is the negation of the latter (i.e. RAID0 implies _not_ using
multiple disks, using multiple disks implies _not_ using RAID0, etc.)

The combination of "using multiple disks", "using multiple stores" and
"using RAID0" can result in some cognitive overhead. Simplify things by
adopting the "using multiple stores" parlance.

Replace the `RAID0` roachtest cluster spec option with the
`MultipleStores` option, updating existing call-sites with the negation
(i.e. `RAID0(true)` becomes `MultipleStores(false)`, etc.).

Allow AWS roachtests to enable / disable multiple stores. Previously,
this functionality was limited to GCP clusters.

Touches cockroachdb#82423.

Release note: None.
Copy link

github-actions bot commented Jan 3, 2024

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@srosenberg
Copy link
Member

The footgun has been disabled owing to [1]. We still use machine types with local (scratch) disks, both in AWS and Azure. These are unavoidable in some cases since new machine types don't provide the option without scratch disks. However, those will never be RAIDed with persistent disks. I am closing the issue on the basis that the original problem has been resolved. Feel free to reopen if there is still an issue wrt UX.

[1] #98783

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachprod C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

4 participants