Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod,roachtest: clean up references to raid and multiple stores #82427

Closed

Conversation

nicktrav
Copy link
Collaborator

@nicktrav nicktrav commented Jun 3, 2022

Currently, roachprod has flags for enabling multiple stores on nodes in
a cluster, defaulting to false. If more than one storage device is
present on a node, the devices are combined into a RAID0 (striped)
volume. The terms "using multiple disks" and "using multiple stores" are
also used interchangeably.

Roachtest has its own cluster spec option, RAID0(enabled) for enabling
RAID0, passing through the UseMultipleDisks parameter to roachprod.
The former is the negation of the latter (i.e. RAID0 implies not using
multiple disks, using multiple disks implies not using RAID0, etc.)

The combination of "using multiple disks", "using multiple stores" and
"using RAID0" can result in some cognitive overhead. Simplify things by
adopting the "using multiple stores" parlance.

Replace the RAID0 roachtest cluster spec option with the
MultipleStores option, updating existing call-sites with the negation
(i.e. RAID0(true) becomes MultipleStores(false), etc.).

Allow AWS roachtests to enable / disable multiple stores. Previously,
this functionality was limited to GCP clusters.

Touches #82423.

Release note: None.

Currently, roachprod has flags for enabling multiple stores on nodes in
a cluster, defaulting to `false`. If more than one storage device is
present on a node, the devices are combined into a RAID0 (striped)
volume. The terms "using multiple disks" and "using multiple stores" are
also used interchangeably.

Roachtest has its own cluster spec option, `RAID0(enabled)` for enabling
RAID0, passing through the `UseMultipleDisks` parameter to roachprod.
The former is the negation of the latter (i.e. RAID0 implies _not_ using
multiple disks, using multiple disks implies _not_ using RAID0, etc.)

The combination of "using multiple disks", "using multiple stores" and
"using RAID0" can result in some cognitive overhead. Simplify things by
adopting the "using multiple stores" parlance.

Replace the `RAID0` roachtest cluster spec option with the
`MultipleStores` option, updating existing call-sites with the negation
(i.e. `RAID0(true)` becomes `MultipleStores(false)`, etc.).

Allow AWS roachtests to enable / disable multiple stores. Previously,
this functionality was limited to GCP clusters.

Touches cockroachdb#82423.

Release note: None.
@nicktrav nicktrav requested a review from tbg June 3, 2022 21:03
@nicktrav nicktrav requested a review from a team as a code owner June 3, 2022 21:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -64,7 +64,7 @@ type ClusterSpec struct {
// MakeClusterSpec makes a ClusterSpec.
func MakeClusterSpec(cloud string, instanceType string, nodeCount int, opts ...Option) ClusterSpec {
spec := ClusterSpec{Cloud: cloud, InstanceType: instanceType, NodeCount: nodeCount}
defaultOpts := []Option{CPU(4), nodeLifetimeOption(12 * time.Hour), ReuseAny()}
defaultOpts := []Option{CPU(4), nodeLifetimeOption(12 * time.Hour), ReuseAny(), MultipleStores(true)}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing default (changed in #72803) is to use multiple stores on GCP. This didn't apply for AWS, as one can't use spec.SSD with AWS clusters (yet). This just makes it consistent across clouds.

@@ -128,13 +130,7 @@ func getGCEOpts(
opts.Zones = zones
}
opts.SSDCount = localSSDCount
if localSSD && localSSDCount > 0 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes it simpler to understand - multiple stores / single store no matter whether the disks are SSDs or PDs. The same applies to AWS.

@tbg
Copy link
Member

tbg commented Jun 7, 2022

Confirmed that with this spec:

Cluster: r.MakeClusterSpec(4, spec.CPU(4), spec.ReuseNone(), spec.AvoidSSD(), spec.SSD(0), spec.MultipleStores(true)),

I'm getting

nvme2n1     259:0    0 93.1G  0 disk /mnt/data2
nvme1n1     259:1    0  500G  0 disk /mnt/data1

i.e. the EBS volume is data1. This is what I wanted. I need to check if I need the AvoidSSD and SSD(0) options as well. This status quo is still a bit odd. I want that test to only have EBS. I don't want the local SSD. But, it's there, despite my attempts to not have it. I think this is how it's going to be - c5.xlarge instances just come with 100gb nvme ssd - but it makes the default of striping pretty annoying. I would feel better if we made striping not the default, just taking into account how much time I burned not realizing I was striping over two storage media with very different characteristics.
This would lead to a possible new problem - you have multiple stores but are accidentally really using just one - but I think that is more of a niche problem, and also a lot more obvious. WDYT?

@srosenberg
Copy link
Member

srosenberg commented Jun 7, 2022

This status quo is still a bit odd. I want that test to only have EBS. I don't want the local SSD. But, it's there, despite my attempts to not have it. I think this is how it's going to be - c5.xlarge instances just come with 100gb nvme ssd - but it makes the default of striping pretty annoying. I would feel better if we made striping not the default, just taking into account how much time I burned not realizing I was striping over two storage media with very different characteristics. This would lead to a possible new problem - you have multiple stores but are accidentally really using just one - but I think that is more of a niche problem, and also a lot more obvious. WDYT?

I wonder if it's time to extend the spec. to support any combination of persistent drives and local SSDs, optionally using RAID0. Roughly, this would look something like this,

type DiskSpec struct {
   Type	      string
   Size	      int
   Volume       VolumeSpec
   BootDisk    bool
}

type VolumeSpec struct {
   Name       string
   FileSystem fileSystemType
}

type ClusterSpec struct {
   ...
   Disks  []DiskSpec
   Stores []VolumeSpec
   ...
}

Disks with the same VolumeSpec are effectively configured via RAID0. Stores essentially correspond to VolumeSpecs. (For comparison, terraform has a lot more options [1], many of which we simply don't need atm.) Obviously, the common path would rely on helper functions (elided) to construct a reasonable default, possibly matching what we do today. Sorry for the digression, I am just wondering how much longer until our leaky abstraction requires a face lift.

[1] https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk#argument-reference

@nicktrav
Copy link
Collaborator Author

nicktrav commented Jun 9, 2022

Cluster: r.MakeClusterSpec(4, spec.CPU(4), spec.ReuseNone(), spec.AvoidSSD(), spec.SSD(0), spec.MultipleStores(true))

I want that test to only have EBS. I don't want the local SSD. But, it's there, despite my attempts to not have it. I think this is how it's going to be - c5.xlarge instances just come with 100gb nvme ssd - but it makes the default of striping pretty annoying.

@tbg - You're getting the extra disk because we use instances with disks, explicitly, by default (here), calculated from the number of CPUs one requests - note the d ("disk", perhaps?). This is how you end up with the extra disk. If you explicitly use something like a c5 (sans d), you wont have this issue. I confirmed that and posted the results here.

I would feel better if we made striping not the default

This patch isn't addressing that, nor the default instance type thing mentioned above, as it would probably require an audit of existing tests to see if we'd be pulling the rug on anyone that relies on it. I learned the hard way in #72803. Note that that PR actually changed some default behavior for roachtests on GCE (defaulting to multiple stores).

That said, I agree with you that we probably shouldn't stripe by default. That's all handled here (AWS) and here (GCP), fwiw.

@srosenberg - I agree we probably need to think about extending the spec. I didn't really want to bite off more than I wanted to chew on for this change.

My intention with this patch was cosmetic - do away with the RAID0 option in place of MultipleStores, which takes effect if one has multiple disks present (see the linked bash script above). If MultipleStores, no striping, if !MultipleStores, stripe. It also makes that behavior consistent across GCP and AWS.

Perhaps another way of handling this could be to use SingleStore, which defaults to false, which would be analogous to the existing RAID0 option. You'd have to explicitly opt into using a single store. However, if you only have a single disk, it's then confusing to have the default be false. /shrug

If we'd prefer to go with the larger rework to extend the spec, which would most likely obsolete this patch, I'm happy to decline this in favor of that.

@tbg
Copy link
Member

tbg commented Jun 9, 2022

If you explicitly use something like a c5 (sans d), you wont have this issue. I confirmed that and posted the results #82109 (comment).

Right but I can't since this is a roachtest that is cloud agnostic. Or rather, I can, but it seems ugly.

@tbg tbg removed their request for review June 21, 2022 18:40
@tbg
Copy link
Member

tbg commented Jun 21, 2022

Un-requesting review since this PR is stalled - happy to re-engage when pinged.

@nicktrav
Copy link
Collaborator Author

Going to close out for now. The "right fix" likely requires a restructuring of some existing specs. Ideally the conversation here would inform such a fix, in addition to what's mentioned in #82423 already.

@nicktrav nicktrav closed this Jun 21, 2022
@nicktrav
Copy link
Collaborator Author

TFTRs!

@nicktrav nicktrav deleted the nickt.roachprod-multiple-stores branch June 21, 2022 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants