Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: skip some tests under race #116080

Merged
merged 1 commit into from
Dec 11, 2023
Merged

Conversation

rickystewart
Copy link
Collaborator

All of these tests have OOM or timeout issues when running under race.

Epic: CRDB-8308
Release note: None

@rickystewart rickystewart requested review from rail and a team December 11, 2023 18:57
@rickystewart rickystewart requested review from a team as code owners December 11, 2023 18:58
@rickystewart rickystewart requested a review from a team December 11, 2023 18:58
@rickystewart rickystewart requested review from a team as code owners December 11, 2023 18:58
@rickystewart rickystewart requested review from dt and mgartner and removed request for a team December 11, 2023 18:58
@cockroach-teamcity
Copy link
Member

This change is Reviewable

All of these tests have OOM or timeout issues when running under `race`.

Epic: CRDB-8308
Release note: None
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why all of these tests are OOMing under race? It's a bit worrisome that it's not just a single test; quite a few seem to be OOMing all of a sudden.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @dt, @herkolategan, and @mgartner)

@rickystewart
Copy link
Collaborator Author

Do we know why all of these tests are OOMing under race?

The memory overhead for race is huge: 5-10x according to the documentation. It's not surprising some already memory-intensive tests are hitting the 4GB limit with race enabled.

It's a bit worrisome that it's not just a single test; quite a few seem to be OOMing all of a sudden.

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

@srosenberg
Copy link
Member

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

Whereas before OOMs weren't being reported as test failures?

@rickystewart
Copy link
Collaborator Author

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

Whereas before OOMs weren't being reported as test failures?

Whereas before the OOM's were less likely to occur since unit test machines have ~192GB memory and nightly stress machines have ~32GB. It's just a more constrained environment and unlike old-style nightly stress, we manage memory on a per-test binary basis, rather than a "normal" test run that lets memory "float" between all concurrently running tests on the same machine.

@srosenberg
Copy link
Member

Whereas before the OOM's were less likely to occur since unit test machines have ~192GB memory and nightly stress machines have ~32GB. It's just a more constrained environment and unlike old-style nightly stress, we manage memory on a per-test binary basis, rather than a "normal" test run that lets memory "float" between all concurrently running tests on the same machine.

Got it, thanks.

@srosenberg srosenberg self-requested a review December 11, 2023 20:45
@rickystewart
Copy link
Collaborator Author

TFTRs!

bors r=rail,srosenberg

@craig
Copy link
Contributor

craig bot commented Dec 11, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Dec 11, 2023

Build succeeded:

@craig craig bot merged commit 3f68d43 into cockroachdb:master Dec 11, 2023
9 checks passed
@michae2
Copy link
Collaborator

michae2 commented Dec 21, 2023

@rickystewart is there an issue tracking all of these skipped tests under race due to memory limits? Are we planning to eventually raise the memory limits in order to run these tests?

@rickystewart
Copy link
Collaborator Author

No, there is no tracking issue.

Are we planning to eventually raise the memory limits in order to run these tests?

No. It's unclear what we would raise the limit to. The memory overhead of running stuff under race can be 10x. Is there e.g. a particular test you're worried about?

@mgartner
Copy link
Collaborator

Can we size up the hardware to match the hardware that was successfully running these tests in the previous TeamCity configuration? It will take time to reduce the overhead of these tests, and that's not work we've planned for in the near future. We'd rather not skip tests and lose coverage.

@rickystewart
Copy link
Collaborator Author

rickystewart commented Dec 21, 2023

Can we size up the hardware to match the hardware that was successfully running these tests in the previous TeamCity configuration?

For the old iteration of nightly stress the machines had 32GB memory. For running a single unit test, 32GB is extremely excessive IMO. One worry I have is that we'll "hide" memory leaks especially for tests that are not running under race. Having a lower memory limit is desirable inasmuch as we are notified about it if some code has some janky logic that ends up allocating 31GB or something like that. (Not saying this concern is any more or less important than what you have raised, just saying that this is one element that is on my mind.)

Currently the Large pool gets 8GB. As an incremental step we can bump it up to 16GB (still plenty of memory for a single test) and then I'll see what can be un-skipped. If it still needs to go higher then we can address it as a follow-up.

I'll make the request with the vendor today with the expectation that it will probably not be implemented until the new year.

@yuzefovich
Copy link
Member

One data point is that some of the skips (e.g. TestImportComputed) might not be OOM related - rather the failed runs were using the tenant randomization, and 1 or 2 CPU EngFlow executors get overloaded on tests that have multi-node clusters when the default test tenant is started. The tenant randomization is now disabled automatically as of #116910.

I'll audit all skips-under-race that we merged in the last two weeks to see which could be explained by this, and will unskip them.

@yuzefovich
Copy link
Member

I sent #116986 to unskip a subset of recent skips.

@rickystewart
Copy link
Collaborator Author

These tests (among others) are un-skipped in #117833. #117894 skipped more tests, but these tests were generally already skipped under stressrace (with a couple exceptions for tests that don't complete in a reasonable amount of time under race under any circumstances and should have been skipped from the jump).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/pgwire/pgwirecancel: TestCancelQueryOtherNode failed
7 participants