Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: double machine size for 8TB restore test #108350

Closed

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Aug 8, 2023

The test occasionally OOMs in the raft stack. Reduce the noise until we fix the
underlying causes of such OOMs.

Similar tests in GCP use n2 machines with 4 GB per CPU, and the AWS test before
this changes used a 2 GB/CPU machine:

$ roachtest list restore | grep TB
restore/tpce/32TB/aws/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/32TB/inc-count=400/gce/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/8TB/aws/nodes=10/cpus=8 [disaster-recovery]

After this commit, all O(TB) restore tests use machines with at least 32 GB of
memory.

Touches #106496
Epic: none
Release note: none

@pav-kv pav-kv requested a review from erikgrinaker August 8, 2023 08:12
@pav-kv pav-kv requested a review from a team as a code owner August 8, 2023 08:12
@pav-kv pav-kv requested review from herkolategan and renatolabs and removed request for a team August 8, 2023 08:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker requested review from a team and rhu713 and removed request for a team August 8, 2023 09:12
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to change the test name, which will lose the roachperf history:

https://roachperf.crdb.dev/?filter=&view=restore%2Ftpce%2F8TB%2Faws%2Fnodes%3D10%2Fcpus%3D8&tab=aws

Can/should we request more memory without bumping the machine size? See spec.High to request high-memory nodes.

Will defer to DR here, requested a review from them.

@pav-kv pav-kv force-pushed the use-larger-machine-for-8TB-test branch from 481429b to e546122 Compare August 8, 2023 09:59
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 8, 2023

@erikgrinaker Good point. Changed the machine type to high-memory.

@erikgrinaker
Copy link
Contributor

May want to add a comment as well saying why we request highmem, and link to the Raft OOM issues.

The test occasionally OOMs in the raft stack. Reduce the noise until we fix the
underlying causes of such OOMs.

Similar tests in GCP use n2 machines with 4 GB per CPU, and the AWS test before
this changes used a 2 GB/CPU machine:

```
$ roachtest list restore | grep TB
restore/tpce/32TB/aws/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/32TB/inc-count=400/aws/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/32TB/inc-count=400/gce/nodes=15/cpus=16 [disaster-recovery]
restore/tpce/8TB/aws/nodes=10/cpus=8 [disaster-recovery]
```

After this commit, the 8TB restore test uses the m6i.2xlarge machine with 32 GB
memory instead of c6i.2xlarge with 16 GB. Now all O(TB) restore tests use
machines with at least 32 GB of memory.

Epic: none
Release note: none
@pav-kv pav-kv force-pushed the use-larger-machine-for-8TB-test branch from e546122 to f93d7e9 Compare August 8, 2023 12:31
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 8, 2023

@erikgrinaker Added the comment.

I noticed that the test still got renamed from restore/tpce/8TB/aws/nodes=10/cpus=8 to restore/tpce/8TB/aws/nodes=10/cpus=8/highmem. Do you think we should work around any further to preserve the old name?

@erikgrinaker
Copy link
Contributor

I see, I'll defer to DR on how important this is for them.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 8, 2023

@rhu713 PTAL. As Erik pointed out above, the test got renamed because of this change, so its roachperf history will be reset. Is this ok, or should we make an effort to retain the test name?

@pav-kv pav-kv added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 8, 2023
@msbutler
Copy link
Collaborator

msbutler commented Aug 8, 2023

I've added a discussion topic for a DR meeting tomorrow. Could we wait to merge this until then?

@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 9, 2023

Going to revert this as per discussion with the team.

@pav-kv pav-kv closed this Aug 9, 2023
@pav-kv pav-kv deleted the use-larger-machine-for-8TB-test branch August 9, 2023 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants