vfs: Deflake TestDiskHealthChecking_File* #2734

itsbilal · 2023-07-12T15:04:57Z

Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes #1718

cockroach-teamcity · 2023-07-12T15:05:10Z

This change is

itsbilal · 2023-07-12T15:15:30Z

TestDiskHealthChecking_File/write#01 (which I did not deflake in this PR) failed in the Windows run. Let me deflake that too

RaduBerinde

Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

vfs/disk_health_test.go line 518 at r1 (raw file):

			select {
			case <-onStall:
			case <-time.After(10 * stallThreshold):

In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.

vfs/disk_health_test.go line 565 at r1 (raw file):

		case stalledPath := <-stalled:
			require.Equal(t, filename, stalledPath)
		case <-time.After(10 * stallThreshold):

ditto

itsbilal

TFTR!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

vfs/disk_health_test.go line 518 at r1 (raw file):

Previously, RaduBerinde wrote…

In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.

Done.

vfs/disk_health_test.go line 565 at r1 (raw file):

Previously, RaduBerinde wrote…

ditto

Done.

RaduBerinde

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)

vfs/disk_health_test.go line 299 at r1 (raw file):

				}
			} else { // no false positives
				select {

This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail

RaduBerinde

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)

vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail

Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.

itsbilal

TFTR!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.

That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.

RaduBerinde

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.

A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the slowThreshold to a few seconds. I'm in favor of axing them though.

itsbilal · 2023-07-12T18:46:18Z

Latest change also selectively skips two diskHealthChecking tests on Windows due to a higher observed flake-probability there. Selective skips like that are allowed under the new guidelines.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.

itsbilal

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the slowThreshold to a few seconds. I'm in favor of axing them though.

Axed.

itsbilal requested a review from a team July 12, 2023 15:04

itsbilal self-assigned this Jul 12, 2023

itsbilal requested a review from sumeerbhola July 12, 2023 15:04

RaduBerinde reviewed Jul 12, 2023

View reviewed changes

itsbilal force-pushed the disk-health-deflake branch 2 times, most recently from e78a33d to 4733dd1 Compare July 12, 2023 18:12

itsbilal commented Jul 12, 2023

View reviewed changes

itsbilal changed the title ~~vfs: Deflake TestDiskHealthChecking_Filesystem_{Create,Close}~~ vfs: Deflake TestDiskHealthChecking_File* Jul 12, 2023

RaduBerinde reviewed Jul 12, 2023

View reviewed changes

RaduBerinde approved these changes Jul 12, 2023

View reviewed changes

itsbilal force-pushed the disk-health-deflake branch 2 times, most recently from cfa1e77 to 9e8fb52 Compare July 12, 2023 18:24

itsbilal commented Jul 12, 2023

View reviewed changes

RaduBerinde reviewed Jul 12, 2023

View reviewed changes

itsbilal force-pushed the disk-health-deflake branch from 9e8fb52 to f7c8b53 Compare July 12, 2023 18:45

itsbilal force-pushed the disk-health-deflake branch from f7c8b53 to 6458b11 Compare July 12, 2023 18:47

itsbilal commented Jul 12, 2023

View reviewed changes

itsbilal merged commit 2ad4e66 into cockroachdb:master Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vfs: Deflake TestDiskHealthChecking_File* #2734

vfs: Deflake TestDiskHealthChecking_File* #2734

itsbilal commented Jul 12, 2023 •

edited

Loading

cockroach-teamcity commented Jul 12, 2023

itsbilal commented Jul 12, 2023

RaduBerinde left a comment

itsbilal left a comment

RaduBerinde left a comment

RaduBerinde left a comment

itsbilal left a comment

RaduBerinde left a comment

itsbilal commented Jul 12, 2023

itsbilal left a comment

vfs: Deflake TestDiskHealthChecking_File* #2734

vfs: Deflake TestDiskHealthChecking_File* #2734

Conversation

itsbilal commented Jul 12, 2023 • edited Loading

cockroach-teamcity commented Jul 12, 2023

itsbilal commented Jul 12, 2023

RaduBerinde left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

itsbilal commented Jul 12, 2023

itsbilal left a comment

Choose a reason for hiding this comment

itsbilal commented Jul 12, 2023 •

edited

Loading