Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: RAID0 only network disks if both local and network are present #119906

Merged
merged 1 commit into from
Mar 13, 2024
Merged

roachprod: RAID0 only network disks if both local and network are present #119906

merged 1 commit into from
Mar 13, 2024

Conversation

nameisbhaskar
Copy link
Contributor

@nameisbhaskar nameisbhaskar commented Mar 5, 2024

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none

@nameisbhaskar nameisbhaskar requested a review from a team as a code owner March 5, 2024 07:42
@nameisbhaskar nameisbhaskar requested review from srosenberg and renatolabs and removed request for a team March 5, 2024 07:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nameisbhaskar nameisbhaskar changed the title roachprod: don't RAID0 local SSDs when PDs are present roachprod: RAID0 only network disks if both local and network are present Mar 6, 2024
@srosenberg
Copy link
Member

srosenberg commented Mar 10, 2024

We should resolve this TODO [1] and do a full (test) run in both GCE and AWS. Have you thought how to test this definitively? Note that the startup script is captured by the journalctl log (see FetchJournalctl). However, we don't save these unless a test failed (see collectArtifacts). Maybe we change that behavior, at least for the test runs? This way we can post-analyze all the startup scripts to ensure there were no "funny" RAID`ing. What do you think?

[1]

// Work around an issue that RAID0s local NVMe and GP3 storage together:
// https://github.com/cockroachdb/cockroach/issues/98783.
//
// TODO(srosenberg): Remove this workaround when 98783 is addressed.
// TODO(miral): This now returns an error instead of panicking, so even though
// we haven't panicked here before, we should handle the error. Moot if this is
// removed as per TODO above.
s.AWS.MachineType, _, _ = spec.SelectAWSMachineType(s.CPUs, s.Mem, false /* shouldSupportLocalSSD */, vm.ArchAMD64)

@nameisbhaskar
Copy link
Contributor Author

We should resolve this TODO [1] and do a full (test) run in both GCE and AWS. Have you thought how to test this definitively? Note that the startup script is captured by the journalctl log (see FetchJournalctl). However, we don't save these unless a test failed (see collectArtifacts). Maybe we change that behavior, at least for the test runs? This way we can post-analyze all the startup scripts to ensure there were no "funny" RAID`ing. What do you think?

[1]

// Work around an issue that RAID0s local NVMe and GP3 storage together:
// https://github.com/cockroachdb/cockroach/issues/98783.
//
// TODO(srosenberg): Remove this workaround when 98783 is addressed.
// TODO(miral): This now returns an error instead of panicking, so even though
// we haven't panicked here before, we should handle the error. Moot if this is
// removed as per TODO above.
s.AWS.MachineType, _, _ = spec.SelectAWSMachineType(s.CPUs, s.Mem, false /* shouldSupportLocalSSD */, vm.ArchAMD64)

Regarding testing, I have tested roachprod deployment with different disk configuration. Here are the scenarios and the expected behaviour as per my understanding:

  1. (local SSD = 0, Network Disk - 1) - no RAID'ing
  2. (local SSD = 1, Network Disk - 0) - no RAID'ing
  3. (local SSD >= 1, Network Disk = 1) - no RAID'ing
  4. (local SSD > 1, Network Disk = 0) - local SSDs selected for RAID'ing
  5. (local SSD >= 0, Network Disk > 1) - network disks selected for RAID'ing
    Please let me know if this is the correct assumption. Also, if there are a combination of SSDs and 1 network disk, all the disks will be mounted. Is that also the expected behaviour?

@nameisbhaskar
Copy link
Contributor Author

made some changes to only use network disk if present:

(local SSD = 0, Network Disk - 1) - no RAID'ing and mount network disk

(local SSD = 1, Network Disk - 0) - no RAID'ing and mount local SSD

(local SSD >= 1, Network Disk = 1) - no RAID'ing and mount network disk

(local SSD > 1, Network Disk = 0) - local SSDs selected for RAID'ing

(local SSD >= 0, Network Disk > 1) - network disks selected for RAID'ing

…sent

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none
@srosenberg srosenberg added backport-23.2.x Flags PRs that need to be backported to 23.2. backport-23.1.x Flags PRs that need to be backported to 23.1 labels Mar 12, 2024
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@nameisbhaskar
Copy link
Contributor Author

Thanks for reviewing and approving the change @srosenberg !

@nameisbhaskar
Copy link
Contributor Author

bors r=srosenberg

@craig
Copy link
Contributor

craig bot commented Mar 13, 2024

@craig craig bot merged commit aa755fd into cockroachdb:master Mar 13, 2024
19 checks passed
Copy link

blathers-crl bot commented Mar 13, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6fd7ff8 to blathers/backport-release-23.1-119906: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nameisbhaskar nameisbhaskar deleted the user/bhaskar/persistent_disk_issue_gce branch March 13, 2024 06:50
@renatolabs
Copy link
Contributor

Nice work!

Sorry if I missed something, but should we do something for Azure as well?

@nameisbhaskar
Copy link
Contributor Author

Nice work!

Sorry if I missed something, but should we do something for Azure as well?

Azure does not have the RAID'ing implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachprod,roachtest: don't RAID0 local SSD and PD/EBS
4 participants