-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vfs: Deflake TestDiskHealthChecking_File* #2734
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
vfs/disk_health_test.go
line 518 at r1 (raw file):
select { case <-onStall: case <-time.After(10 * stallThreshold):
In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.
vfs/disk_health_test.go
line 565 at r1 (raw file):
case stalledPath := <-stalled: require.Equal(t, filename, stalledPath) case <-time.After(10 * stallThreshold):
ditto
e78a33d
to
4733dd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
vfs/disk_health_test.go
line 518 at r1 (raw file):
Previously, RaduBerinde wrote…
In rare cases (or stress/race runs and such) we can get arbitrary delays on the order of this value (100ms). I'd add something like 10 seconds to it just for good measure.
Done.
vfs/disk_health_test.go
line 565 at r1 (raw file):
Previously, RaduBerinde wrote…
ditto
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)
vfs/disk_health_test.go
line 299 at r1 (raw file):
} } else { // no false positives select {
This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)
vfs/disk_health_test.go
line 299 at r1 (raw file):
Previously, RaduBerinde wrote…
This seems robust to me.. if the test is slow, we might get a false pass. But we should never get a false fail
Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.
cfa1e77
to
9e8fb52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
vfs/disk_health_test.go
line 299 at r1 (raw file):
Previously, RaduBerinde wrote…
Oh, or I guess we're getting reported stalls because things are slower than the 50ms threshold, that makes sense.
That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
vfs/disk_health_test.go
line 299 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
That's what I thought at first. Brought it back for now with far more generous timeouts, let's see what gives 🤞 if it flakes again I swing the hammer at it.
A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the slowThreshold
to a few seconds. I'm in favor of axing them though.
9e8fb52
to
f7c8b53
Compare
Latest change also selectively skips two diskHealthChecking tests on Windows due to a higher observed flake-probability there. Selective skips like that are allowed under the new guidelines. |
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.
f7c8b53
to
6458b11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
vfs/disk_health_test.go
line 299 at r1 (raw file):
Previously, RaduBerinde wrote…
A 50ms scheduling delay can happen, so there would still be spurious stalls. We'd need to up the
slowThreshold
to a few seconds. I'm in favor of axing them though.
Axed.
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.
Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.
Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.
Also selectively skips two tests on windows that almost exclusively
just flaked there.
Fixes #1718