Skip to content

Commit

Permalink
[wip] roachprod: don't use RAID0 by default
Browse files Browse the repository at this point in the history
This is a WIP because the behavior when machine types with local SSD are
used is unclear. For example, on AWS, roachtest prefers the c5d family,
which all come with local SST storage. But looking into
`awsStartupScriptTemplate`, it seems unclear how to make sure that the
EBS disk(s) get mounted as /mnt/data1 (which is probably what the
default should be).

We could also entertain straight-up preventing combinations that would
lead to an inhomogeneous RAID0. I imagine we'd have to take a round of
failures to find all of the places in which it happens, but perhaps
a "snitch" can be inserted instead so that we can detect all such
callers and fix them up before arming the check.

By the way, EBS disks on AWS come with a default of 125mb/s which is
less than this RAID0 gets "most of the time" - so we can expect some
tests to behave differently after this change. I still believe this
is worth it - debugging is so much harder when you're on top of a
storage that's hard to predict and doesn't resemble any production
deployment.

----

I wasted weeks of my life on this before, and it almost happened again!
When you run a roachtest that asks for an AWS cXd machine (i.e. compute
optimized with NVMe local disk), and you specify a VolumeSize, you also
get an EBS volume. Prior to these commit, these would be RAID0'ed
together.

This isn't something sane - the resulting gp3 EBS volume is very
different from the local NVMe volume in every way, and it lead to
hard-to-understand write throughput behavior.

This commit defaults to *not* using RAID0.

Touches cockroachdb#98767.
Touches cockroachdb#98576.
Touches cockroachdb#97019.

Epic: none
Release note: None
  • Loading branch information
tbg committed Mar 16, 2023
1 parent 4dc10b5 commit 199969f
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 8 deletions.
17 changes: 9 additions & 8 deletions pkg/cmd/roachtest/spec/cluster_spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,16 @@ func awsMachineSupportsSSD(machineType string) bool {
return false
}

func getAWSOpts(machineType string, zones []string, volumeSize int, localSSD bool) vm.ProviderOpts {
func getAWSOpts(
machineType string, zones []string, volumeSize int, localSSD bool, RAID0 bool,
) vm.ProviderOpts {
opts := aws.DefaultProviderOpts()
if volumeSize != 0 {
opts.DefaultEBSVolume.Disk.VolumeSize = volumeSize
}
if RAID0 {
opts.UseMultipleDisks = false // NB: the default is true
}
if localSSD {
opts.SSDMachineType = machineType
} else {
Expand Down Expand Up @@ -137,12 +142,8 @@ func getGCEOpts(
opts.Zones = zones
}
opts.SSDCount = localSSDCount
if localSSD && localSSDCount > 0 {
// NB: As the default behavior for _roachprod_ (at least in AWS/GCP) is
// to mount multiple disks as a single store using a RAID 0 array, we
// must explicitly ask for multiple stores to be enabled, _unless_ the
// test has explicitly asked for RAID0.
opts.UseMultipleDisks = !RAID0
if RAID0 {
opts.UseMultipleDisks = false // NB: the default is true, i.e. no RAID0
}
opts.TerminateOnMigration = terminateOnMigration

Expand Down Expand Up @@ -250,7 +251,7 @@ func (s *ClusterSpec) RoachprodOpts(
var providerOpts vm.ProviderOpts
switch s.Cloud {
case AWS:
providerOpts = getAWSOpts(machineType, zones, s.VolumeSize, createVMOpts.SSDOpts.UseLocalSSD)
providerOpts = getAWSOpts(machineType, zones, s.VolumeSize, createVMOpts.SSDOpts.UseLocalSSD, s.RAID0)
case GCE:
providerOpts = getGCEOpts(machineType, zones, s.VolumeSize, ssdCount,
createVMOpts.SSDOpts.UseLocalSSD, s.RAID0, s.TerminateOnMigration)
Expand Down
3 changes: 3 additions & 0 deletions pkg/cmd/roachtest/spec/machine_type.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ func AWSMachineType(cpus int, highmem bool) string {
}

// There is no c5d.24xlarge.
//
// TODO(tbg): there seems to be, see:
// https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized-instances.html
if family == "c5d" && size == "24xlarge" {
family = "m5d"
}
Expand Down
1 change: 1 addition & 0 deletions pkg/roachprod/vm/aws/aws.go
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ func DefaultProviderOpts() *ProviderOpts {
RemoteUserName: "ubuntu",
DefaultEBSVolume: defaultEBSVolumeValue,
CreateRateLimit: 2,
UseMultipleDisks: true, // don't default to RAID0
}
}

Expand Down
1 change: 1 addition & 0 deletions pkg/roachprod/vm/gce/gcloud.go
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,7 @@ func DefaultProviderOpts() *ProviderOpts {
PDVolumeType: "pd-ssd",
PDVolumeSize: 500,
TerminateOnMigration: false,
UseMultipleDisks: true, // don't default to RAID0
useSharedUser: true,
preemptible: false,
}
Expand Down

0 comments on commit 199969f

Please sign in to comment.