*: skip some tests under `race` #116080

rickystewart · 2023-12-11T18:57:58Z

All of these tests have OOM or timeout issues when running under race.

Epic: CRDB-8308
Release note: None

cockroach-teamcity · 2023-12-11T18:58:19Z

This change is

All of these tests have OOM or timeout issues when running under `race`. Epic: CRDB-8308 Release note: None

srosenberg

Do we know why all of these tests are OOMing under race? It's a bit worrisome that it's not just a single test; quite a few seem to be OOMing all of a sudden.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @dt, @herkolategan, and @mgartner)

rickystewart · 2023-12-11T20:18:28Z

Do we know why all of these tests are OOMing under race?

The memory overhead for race is huge: 5-10x according to the documentation. It's not surprising some already memory-intensive tests are hitting the 4GB limit with race enabled.

It's a bit worrisome that it's not just a single test; quite a few seem to be OOMing all of a sudden.

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

srosenberg · 2023-12-11T20:21:38Z

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

Whereas before OOMs weren't being reported as test failures?

rickystewart · 2023-12-11T20:38:11Z

We just started reporting test failures from EngFlow for race tests. It doesn't have anything to do with a code change.

Whereas before OOMs weren't being reported as test failures?

Whereas before the OOM's were less likely to occur since unit test machines have ~192GB memory and nightly stress machines have ~32GB. It's just a more constrained environment and unlike old-style nightly stress, we manage memory on a per-test binary basis, rather than a "normal" test run that lets memory "float" between all concurrently running tests on the same machine.

srosenberg · 2023-12-11T20:45:04Z

Whereas before the OOM's were less likely to occur since unit test machines have ~192GB memory and nightly stress machines have ~32GB. It's just a more constrained environment and unlike old-style nightly stress, we manage memory on a per-test binary basis, rather than a "normal" test run that lets memory "float" between all concurrently running tests on the same machine.

Got it, thanks.

rickystewart · 2023-12-11T21:00:14Z

TFTRs!

bors r=rail,srosenberg

craig · 2023-12-11T21:57:46Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-12-11T23:25:21Z

Build succeeded:

Bazel Essential CI (Cockroach)

michae2 · 2023-12-21T16:42:56Z

@rickystewart is there an issue tracking all of these skipped tests under race due to memory limits? Are we planning to eventually raise the memory limits in order to run these tests?

rickystewart · 2023-12-21T16:55:42Z

No, there is no tracking issue.

Are we planning to eventually raise the memory limits in order to run these tests?

No. It's unclear what we would raise the limit to. The memory overhead of running stuff under race can be 10x. Is there e.g. a particular test you're worried about?

mgartner · 2023-12-21T20:05:14Z

Can we size up the hardware to match the hardware that was successfully running these tests in the previous TeamCity configuration? It will take time to reduce the overhead of these tests, and that's not work we've planned for in the near future. We'd rather not skip tests and lose coverage.

rickystewart · 2023-12-21T20:26:46Z

Can we size up the hardware to match the hardware that was successfully running these tests in the previous TeamCity configuration?

For the old iteration of nightly stress the machines had 32GB memory. For running a single unit test, 32GB is extremely excessive IMO. One worry I have is that we'll "hide" memory leaks especially for tests that are not running under race. Having a lower memory limit is desirable inasmuch as we are notified about it if some code has some janky logic that ends up allocating 31GB or something like that. (Not saying this concern is any more or less important than what you have raised, just saying that this is one element that is on my mind.)

Currently the Large pool gets 8GB. As an incremental step we can bump it up to 16GB (still plenty of memory for a single test) and then I'll see what can be un-skipped. If it still needs to go higher then we can address it as a follow-up.

I'll make the request with the vendor today with the expectation that it will probably not be implemented until the new year.

yuzefovich · 2023-12-21T23:01:38Z

One data point is that some of the skips (e.g. TestImportComputed) might not be OOM related - rather the failed runs were using the tenant randomization, and 1 or 2 CPU EngFlow executors get overloaded on tests that have multi-node clusters when the default test tenant is started. The tenant randomization is now disabled automatically as of #116910.

I'll audit all skips-under-race that we merged in the last two weeks to see which could be explained by this, and will unskip them.

yuzefovich · 2023-12-21T23:26:24Z

I sent #116986 to unskip a subset of recent skips.

rickystewart · 2024-01-18T17:16:19Z

These tests (among others) are un-skipped in #117833. #117894 skipped more tests, but these tests were generally already skipped under stressrace (with a couple exceptions for tests that don't complete in a reasonable amount of time under race under any circumstances and should have been skipped from the jump).

rickystewart requested review from rail and a team December 11, 2023 18:57

rickystewart requested review from a team as code owners December 11, 2023 18:58

rickystewart requested a review from a team December 11, 2023 18:58

rickystewart requested review from a team as code owners December 11, 2023 18:58

rickystewart requested review from dt and mgartner and removed request for a team December 11, 2023 18:58

rail approved these changes Dec 11, 2023

View reviewed changes

rickystewart force-pushed the skip-race-tests branch from 60af77a to 72ca206 Compare December 11, 2023 19:58

rafiss linked an issue Dec 11, 2023 that may be closed by this pull request

sql/pgwire/pgwirecancel: TestCancelQueryOtherNode failed #116031

Closed

*: skip some tests under race

381c96b

All of these tests have OOM or timeout issues when running under `race`. Epic: CRDB-8308 Release note: None

rickystewart force-pushed the skip-race-tests branch from 72ca206 to 381c96b Compare December 11, 2023 20:05

rickystewart requested a review from a team as a code owner December 11, 2023 20:05

rickystewart requested review from herkolategan and DarrylWong and removed request for a team December 11, 2023 20:05

rickystewart mentioned this pull request Dec 11, 2023

pkg/ccl/workloadccl/workloadccl_test: TestImportFixtureNodeCount failed #115977

Closed

srosenberg reviewed Dec 11, 2023

View reviewed changes

srosenberg self-requested a review December 11, 2023 20:45

srosenberg approved these changes Dec 11, 2023

View reviewed changes

craig bot merged commit 3f68d43 into cockroachdb:master Dec 11, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: skip some tests under `race` #116080

*: skip some tests under `race` #116080

rickystewart commented Dec 11, 2023

cockroach-teamcity commented Dec 11, 2023

srosenberg left a comment

rickystewart commented Dec 11, 2023

srosenberg commented Dec 11, 2023

rickystewart commented Dec 11, 2023

srosenberg commented Dec 11, 2023

rickystewart commented Dec 11, 2023

craig bot commented Dec 11, 2023

craig bot commented Dec 11, 2023

michae2 commented Dec 21, 2023

rickystewart commented Dec 21, 2023

mgartner commented Dec 21, 2023

rickystewart commented Dec 21, 2023 •

edited

Loading

yuzefovich commented Dec 21, 2023

yuzefovich commented Dec 21, 2023

rickystewart commented Jan 18, 2024

*: skip some tests under race #116080

*: skip some tests under race #116080

Conversation

rickystewart commented Dec 11, 2023

cockroach-teamcity commented Dec 11, 2023

srosenberg left a comment

Choose a reason for hiding this comment

rickystewart commented Dec 11, 2023

srosenberg commented Dec 11, 2023

rickystewart commented Dec 11, 2023

srosenberg commented Dec 11, 2023

rickystewart commented Dec 11, 2023

craig bot commented Dec 11, 2023

craig bot commented Dec 11, 2023

michae2 commented Dec 21, 2023

rickystewart commented Dec 21, 2023

mgartner commented Dec 21, 2023

rickystewart commented Dec 21, 2023 • edited Loading

yuzefovich commented Dec 21, 2023

yuzefovich commented Dec 21, 2023

rickystewart commented Jan 18, 2024

*: skip some tests under `race` #116080

*: skip some tests under `race` #116080

rickystewart commented Dec 21, 2023 •

edited

Loading