Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: Deflake TestDiskHealthChecking_File* #2734

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Jul 12, 2023

Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes #1718

@itsbilal itsbilal requested a review from a team July 12, 2023 15:04
@itsbilal itsbilal self-assigned this Jul 12, 2023
@itsbilal itsbilal requested a review from sumeerbhola July 12, 2023 15:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Member Author

TestDiskHealthChecking_File/write#01 (which I did not deflake in this PR) failed in the Windows run. Let me deflake that too

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


vfs/disk_health_test.go line 518 at r1 (raw file):

			select {
			case <-onStall:
			case <-time.After(10 * stallThreshold):

In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.


vfs/disk_health_test.go line 565 at r1 (raw file):

		case stalledPath := <-stalled:
			require.Equal(t, filename, stalledPath)
		case <-time.After(10 * stallThreshold):

ditto

@itsbilal itsbilal force-pushed the disk-health-deflake branch 2 times, most recently from e78a33d to 4733dd1 Compare July 12, 2023 18:12
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)


vfs/disk_health_test.go line 518 at r1 (raw file):

Previously, RaduBerinde wrote…

In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.

Done.


vfs/disk_health_test.go line 565 at r1 (raw file):

Previously, RaduBerinde wrote…

ditto

Done.

@itsbilal itsbilal changed the title vfs: Deflake TestDiskHealthChecking_Filesystem_{Create,Close} vfs: Deflake TestDiskHealthChecking_File* Jul 12, 2023
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)


vfs/disk_health_test.go line 299 at r1 (raw file):

				}
			} else { // no false positives
				select {

This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)


vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail

Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.

@itsbilal itsbilal force-pushed the disk-health-deflake branch 2 times, most recently from cfa1e77 to 9e8fb52 Compare July 12, 2023 18:24
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)


vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.

That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)


vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.

A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the slowThreshold to a few seconds. I'm in favor of axing them though.

@itsbilal itsbilal force-pushed the disk-health-deflake branch from 9e8fb52 to f7c8b53 Compare July 12, 2023 18:45
@itsbilal
Copy link
Member Author

Latest change also selectively skips two diskHealthChecking tests on Windows due to a higher observed flake-probability there. Selective skips like that are allowed under the new guidelines.

Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes cockroachdb#1718.
@itsbilal itsbilal force-pushed the disk-health-deflake branch from f7c8b53 to 6458b11 Compare July 12, 2023 18:47
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)


vfs/disk_health_test.go line 299 at r1 (raw file):

Previously, RaduBerinde wrote…

A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the slowThreshold to a few seconds. I'm in favor of axing them though.

Axed.

@itsbilal itsbilal merged commit 2ad4e66 into cockroachdb:master Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vfs: flaky test TestDiskHealthChecking_Filesystem
3 participants