Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod,roachtest: fix mounting on GCP and disable RAID0 in roachtests #72803

Merged
merged 1 commit into from
Nov 17, 2021

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Nov 16, 2021

Fixes bugs introduced by #72553, which ported support for multiple
stores from AWS to GCP. This change fixes the bugs that caused
disks mounted to not be written to /etc/fstab on GCP instances,
causing mounts to go missing on instance restart. Additionally, this
change modifies roachtest to default to creating multiple stores on GCP
in roachprod invocations rather than mounting a RAID 0 array, thus
preserving the existing behavior used in roachtests such as
kv/multi-store-with-overload.

Fixes #72635

Release note: None

@AlexTalks AlexTalks requested a review from a team as a code owner November 16, 2021 06:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@AlexTalks AlexTalks requested review from tbg and nicktrav November 16, 2021 06:48
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @tbg)


pkg/roachprod/vm/gce/utils.go, line 114 at r1 (raw file):

{{ else }}
  raiddisk="/dev/md0"
  mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"

Note for @nicktrav : it's not clear to me that this functionality works even now. Prior to my fixes, when I ran roachtest kv/multi-store-with-overload, which (admittedly) creates nodes with 2 SSDs intending them to be separate stores rather than a RAID 0 array, this line resulted in the following error message:

GCEMetadataScripts: startup-script: mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nicktrav)


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

    mkfs.ext4 -q -F ${disk}
    mount -o ${mount_opts} ${disk} ${mountpoint}
    echo "{d} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab

Shouldn't this be ${d}?
Also, couldn't we write to fstab and then "just" call mount -a which gives us high confidence that it will work after a restart?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nicktrav)

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @tbg)


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Shouldn't this be ${d}?
Also, couldn't we write to fstab and then "just" call mount -a which gives us high confidence that it will work after a restart?

Shouldn’t what be ${d}? That is what I am changing the first param to, do you mean a different one?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nicktrav)


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Shouldn’t what be ${d}? That is what I am changing the first param to, do you mean a different one?

It reads {d} now, not ${d}.

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @tbg)


pkg/roachprod/vm/aws/aws.go, line 304 at r1 (raw file):

		o.ImageAMI, "Override image AMI to use.  See https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/describe-images.html")
	flags.BoolVar(&o.UseMultipleDisks, ProviderName+"-enable-multiple-stores",
		true, "Enable the use of multiple stores by creating one store directory per disk. "+

Is it safe to just change the default like this? Do you need to announce this? I realize that for GCP, this patch reverts to old behavior. However, for AWS, there are probably tests that depend on this being false.

I was curious about this when I was writing the original patch, so I did some digging and it looks like it used to be explicitly set to false. That changed in this commit. Setting it explicitly seems clearer, though my vote would be to keep this as false.

I don't really have a horse in this race, so I won't oppose changing this to true (we'll just need to update our Pebble benchmark scripts). However, I just want to point out that we're changing behavior for AWS.


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It reads {d} now, not ${d}.

Thanks for catching and fixing.


pkg/roachprod/vm/gce/utils.go, line 114 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Note for @nicktrav : it's not clear to me that this functionality works even now. Prior to my fixes, when I ran roachtest kv/multi-store-with-overload, which (admittedly) creates nodes with 2 SSDs intending them to be separate stores rather than a RAID 0 array, this line resulted in the following error message:

GCEMetadataScripts: startup-script: mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array

I think it's just a poor error message. My understanding is that it just means that module wasn't used to create the array. I looked at the source code, and it returns 0 immediately after printing this.

If you read further down in the script output, you'll see the startup script prints the status of the block devices, and the striped disk has indeed been mounted.

For example:

Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop0     7:0    0  55.4M  1 loop  /snap/core18/2066
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop1     7:1    0 230.1M  1 loop  /snap/google-cloud-sdk/184
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop2     7:2    0  67.6M  1 loop  /snap/lxd/20326
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop3     7:3    0  32.1M  1 loop  /snap/snapd/12057
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: sda       8:0    0    10G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: ├─sda1    8:1    0   9.9G  0 part  /
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: ├─sda14   8:14   0     4M  0 part  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─sda15   8:15   0   106M  0 part  /boot/efi
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n1 259:0    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n2 259:1    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n3 259:2    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n4 259:3    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n5 259:4    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n6 259:5    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n7 259:6    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n8 259:7    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
         Stopping �[0;1;39mOpenBSD Secure Shell server�[0m...
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: Filesystem      Size  Used Avail Use% Mounted on
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/root       9.6G  1.6G  8.0G  17% /
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: devtmpfs        7.4G     0  7.4G   0% /dev
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           7.4G     0  7.4G   0% /dev/shm
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           1.5G  2.0M  1.5G   1% /run
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           5.0M     0  5.0M   0% /run/lock
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop1      231M  231M     0 100% /snap/google-cloud-sdk/184
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop0       56M   56M     0 100% /snap/core18/2066
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/sda15      105M  7.9M   97M   8% /boot/efi
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop2       68M   68M     0 100% /snap/lxd/20326
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop3       33M   33M     0 100% /snap/snapd/12057
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           1.5G     0  1.5G   0% /run/user/1050
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/md0        2.9T   89M  2.8T   1% /mnt/data1

Funnily enough, they changed the error message address the confusion we're talking about :)

@AlexTalks
Copy link
Contributor Author

(Replying directly on GitHub PR as Reviewable is down).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nicktrav)

pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…
It reads {d} now, not ${d}.

Ah yes, thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @tbg)

pkg/roachprod/vm/aws/aws.go, line 304 at r1 (raw file):

		o.ImageAMI, "Override image AMI to use.  See https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/describe-images.html")
	flags.BoolVar(&o.UseMultipleDisks, ProviderName+"-enable-multiple-stores",
		true, "Enable the use of multiple stores by creating one store directory per disk. "+

Is it safe to just change the default like this? Do you need to announce this? I realize that for GCP, this patch reverts to old behavior. However, for AWS, there are probably tests that depend on this being false.

I was curious about this when I was writing the original patch, so I did some digging and it looks like it used to be explicitly set to false. That changed in this commit. Setting it explicitly seems clearer, though my vote would be to keep this as false.

I don't really have a horse in this race, so I won't oppose changing this to true (we'll just need to update our Pebble benchmark scripts). However, I just want to point out that we're changing behavior for AWS.

While looking (perhaps non-exhaustively) at roachtest's invocations of roachprod, it seemed like when a test's ClusterSpec specified multiple disks, it was for multiple stores and not for a single RAID 0 store. I also don't have a horse in the race here, so either default seems fine, but if we default to false maybe we should make roachtests incorporate the --[gce|aws]-enable-multiple-stores flag by default, unless a RAID 0 array is explicitly asked for? I suppose I am asking what might be more preferred, common, or expected behavior here, not sure if you might have some insight @nicktrav.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @tbg)


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

It reads {d} now, not ${d}.

Ah yes, thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @tbg)

@nicktrav also any preferences on what @tbg proposed with writing to /etc/fstab first and then just calling mount -a? (I suppose we'd just have to skip this for ZFS?)


pkg/roachprod/vm/gce/utils.go, line 114 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I think it's just a poor error message. My understanding is that it just means that module wasn't used to create the array. I looked at the source code, and it returns 0 immediately after printing this.

If you read further down in the script output, you'll see the startup script prints the status of the block devices, and the striped disk has indeed been mounted.

For example:

Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop0     7:0    0  55.4M  1 loop  /snap/core18/2066
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop1     7:1    0 230.1M  1 loop  /snap/google-cloud-sdk/184
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop2     7:2    0  67.6M  1 loop  /snap/lxd/20326
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: loop3     7:3    0  32.1M  1 loop  /snap/snapd/12057
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: sda       8:0    0    10G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: ├─sda1    8:1    0   9.9G  0 part  /
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: ├─sda14   8:14   0     4M  0 part  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─sda15   8:15   0   106M  0 part  /boot/efi
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n1 259:0    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n2 259:1    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n3 259:2    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n4 259:3    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n5 259:4    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n6 259:5    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n7 259:6    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: nvme0n8 259:7    0   375G  0 disk  
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: └─md0     9:0    0     3T  0 raid0 /mnt/data1
         Stopping �[0;1;39mOpenBSD Secure Shell server�[0m...
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: Filesystem      Size  Used Avail Use% Mounted on
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/root       9.6G  1.6G  8.0G  17% /
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: devtmpfs        7.4G     0  7.4G   0% /dev
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           7.4G     0  7.4G   0% /dev/shm
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           1.5G  2.0M  1.5G   1% /run
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           5.0M     0  5.0M   0% /run/lock
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop1      231M  231M     0 100% /snap/google-cloud-sdk/184
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop0       56M   56M     0 100% /snap/core18/2066
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/sda15      105M  7.9M   97M   8% /boot/efi
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop2       68M   68M     0 100% /snap/lxd/20326
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/loop3       33M   33M     0 100% /snap/snapd/12057
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: tmpfs           1.5G     0  1.5G   0% /run/user/1050
Nov 16 15:52:43 travers-test-0001 GCEMetadataScripts[3276]: 2021/11/16 15:52:43 GCEMetadataScripts: startup-script: /dev/md0        2.9T   89M  2.8T   1% /mnt/data1

Funnily enough, they changed the error message address the confusion we're talking about :)

OK - Good to know, thanks! 👍

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @tbg)


pkg/roachprod/vm/aws/aws.go, line 304 at r1 (raw file):

if you might have some insight @nicktrav.

This has been my first foray into the roachtest infra. Unfortunately, I don't have insight other than what we need for the Pebble / storage benchmarks. I did a quick look too, and the only other multi-SSD configuration (in code) was here.

While looking (perhaps non-exhaustively) at roachtest's invocations of roachprod, it seemed like when a test's ClusterSpec specified multiple disks, it was for multiple stores and not for a single RAID 0 store.

Seems reasonable. Perhaps worth a plug internally (mailing list or Slack) to mention this, just in case?

incorporate the --[gce|aws]-enable-multiple-stores flag by default

I'll put up a patch that explicitly sets this to make it obvious that we (storage) depend on the functionality, and will therefore be immune to whatever the default is set to.

@AlexTalks AlexTalks requested a review from a team as a code owner November 16, 2021 19:30
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @tbg)


pkg/roachprod/vm/aws/aws.go, line 304 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

if you might have some insight @nicktrav.

This has been my first foray into the roachtest infra. Unfortunately, I don't have insight other than what we need for the Pebble / storage benchmarks. I did a quick look too, and the only other multi-SSD configuration (in code) was here.

While looking (perhaps non-exhaustively) at roachtest's invocations of roachprod, it seemed like when a test's ClusterSpec specified multiple disks, it was for multiple stores and not for a single RAID 0 store.

Seems reasonable. Perhaps worth a plug internally (mailing list or Slack) to mention this, just in case?

incorporate the --[gce|aws]-enable-multiple-stores flag by default

I'll put up a patch that explicitly sets this to make it obvious that we (storage) depend on the functionality, and will therefore be immune to whatever the default is set to.

Discussed on internal Slack and agreed it would be best to avoid side effects here. For now we'll just make sure the default in roachtests that create nodes with multiple SSDs is to mount multiple stores, and if RAID0 behavior is needed in roachtest at some point it can be added as a ClusterSpec option.

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks)


pkg/roachprod/vm/gce/utils.go, line 101 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

@nicktrav also any preferences on what @tbg proposed with writing to /etc/fstab first and then just calling mount -a? (I suppose we'd just have to skip this for ZFS?)

No preference there. I'll just say that altering these bash scripts is brittle, so I'd suggest testing manually across a few combinations of config, if you do.

Fixes bugs introduced by cockroachdb#72553, which ported support for multiple
stores from AWS to GCP.  This change fixes the bugs that caused
disks mounted to not be written to `/etc/fstab` on GCP instances,
causing mounts to go missing on instance restart.  Additionally, this
change modifies roachtest to default to creating multiple stores on GCP
in roachprod invocations rather than mounting a RAID 0 array, thus
preserving the existing behavior used in roachtests such as
`kv/multi-store-with-overload`.

Fixes cockroachdb#72635

Release note: None
@AlexTalks
Copy link
Contributor Author

bors r+

@AlexTalks AlexTalks linked an issue Nov 16, 2021 that may be closed by this pull request
@craig
Copy link
Contributor

craig bot commented Nov 17, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 17, 2021

Build succeeded:

@craig craig bot merged commit 45d2d36 into cockroachdb:master Nov 17, 2021
@tbg
Copy link
Member

tbg commented Nov 17, 2021

blathers backport 21.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment