Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

Closed
nicktrav opened this issue May 23, 2022 · 9 comments · Fixed by #2734
Closed

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

nicktrav opened this issue May 23, 2022 · 9 comments · Fixed by #2734
Assignees

Comments

@nicktrav
Copy link
Contributor

See: https://github.com/cockroachdb/pebble/runs/6563748979?check_suite_focus=true

--- FAIL: TestDiskHealthChecking_Filesystem (0.41s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove-all (0.05s)
        assertion_compare.go:211: 
            	Error Trace:	disk_health_test.go:276
            	Error:      	"0" is not greater than "0"
            	Test:       	TestDiskHealthChecking_Filesystem/remove-all
            	Messages:   	[]
@nicktrav
Copy link
Contributor Author

Possibly related to #1703.

@jbowens
Copy link
Collaborator

jbowens commented May 27, 2022

Having trouble reproducing locally. Maybe in the shared-cpu environment of CI, the goroutine scheduling is just less predictable and the ticker routine just doesn't get scheduled within the 50ms :/

1594 runs so far, 0 failures, over 10m0s

@nicktrav
Copy link
Contributor Author

Another one on macOS: https://github.com/cockroachdb/pebble/runs/7083257581?check_suite_focus=true

--- FAIL: TestDiskHealthChecking_Filesystem (0.44s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove (0.05s)
        assertion_compare.go:313: 
            	Error Trace:	disk_health_test.go:276
            	Error:      	"0" is not greater than "0"
            	Test:       	TestDiskHealthChecking_Filesystem/remove
            	Messages:   	[]

@nicktrav
Copy link
Contributor Author

Managed to reproduce this on my macbook. Looks like the same run caught another test failure:

$ go test -mod=vendor -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run TestDiskHealthChecking_Filesystem ./vfs
...
/var/folders/ps/wznnm1sn4_g1rn8jjlgx57rh0000gq/T/go-stress-20220627T170345-028551772
=== RUN   TestDiskHealthChecking_Filesystem
=== RUN   TestDiskHealthChecking_Filesystem/rename
=== RUN   TestDiskHealthChecking_Filesystem/reuse-for-write
=== RUN   TestDiskHealthChecking_Filesystem/create
=== RUN   TestDiskHealthChecking_Filesystem/link
=== RUN   TestDiskHealthChecking_Filesystem/mkdir-all
    assertion_compare.go:313:
                Error Trace:    disk_health_test.go:276
                Error:          "0" is not greater than "0"
                Test:           TestDiskHealthChecking_Filesystem/mkdir-all
                Messages:       []
=== RUN   TestDiskHealthChecking_Filesystem/remove
    assertion_compare.go:313:
                Error Trace:    disk_health_test.go:276
                Error:          "0" is not greater than "0"
                Test:           TestDiskHealthChecking_Filesystem/remove
                Messages:       []
=== RUN   TestDiskHealthChecking_Filesystem/remove-all
--- FAIL: TestDiskHealthChecking_Filesystem (0.56s)
    --- PASS: TestDiskHealthChecking_Filesystem/rename (0.11s)
    --- PASS: TestDiskHealthChecking_Filesystem/reuse-for-write (0.05s)
    --- PASS: TestDiskHealthChecking_Filesystem/create (0.08s)
    --- PASS: TestDiskHealthChecking_Filesystem/link (0.08s)
    --- FAIL: TestDiskHealthChecking_Filesystem/mkdir-all (0.07s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove (0.07s)
    --- PASS: TestDiskHealthChecking_Filesystem/remove-all (0.05s)
=== RUN   TestDiskHealthChecking_Filesystem_Close
--- PASS: TestDiskHealthChecking_Filesystem_Close (0.15s)
FAIL


ERROR: exit status 1

34m5s: 3902 runs so far, 2 failures (0.05%)

This doesn't happen consistently / frequently enough to warrant too much investigation right now, but it's good to know that this is somewhat reproducible.

@nicktrav
Copy link
Contributor Author

nicktrav commented Nov 9, 2022

@nicktrav
Copy link
Contributor Author

Haven't seen this one before:

--- FAIL: TestDiskHealthChecking_Filesystem_Close (0.17s)
    disk_health_test.go:310: 
        	Error Trace:	disk_health_test.go:310
        	Error:      	map[string]time.Duration{"bar":40320000, "foo":56489000} does not contain "bax"
        	Test:       	TestDiskHealthChecking_Filesystem_Close

@joshimhoff
Copy link
Contributor

joshimhoff commented Jan 25, 2023

@bananabrick
Copy link
Contributor

Another one:

2023-05-17T19:32:54.5893691Z === RUN   TestDiskHealthChecking_Filesystem_Close
2023-05-17T19:32:54.5894097Z     disk_health_test.go:536: 
2023-05-17T19:32:54.5894560Z         	Error Trace:	disk_health_test.go:536
2023-05-17T19:32:54.5895267Z         	Error:      	map[string]time.Duration{"bar":47563100, "foo":47101600} does not contain "bax"
2023-05-17T19:32:54.5896012Z         	Test:       	TestDiskHealthChecking_Filesystem_Close

@RaduBerinde
Copy link
Member

@itsbilal itsbilal self-assigned this Jul 11, 2023
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of a channel, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Fixes cockroachdb#1718.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Also deletes TestDiskHealthChecking_File_* variants that tested
for lack of false positives, as given the tiny stall detection
thresholds in the test, spurious stalls are bound to happen on
busy nodes.

Fixes cockroachdb#1718.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Fixes cockroachdb#1718.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Fixes cockroachdb#1718.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes cockroachdb#1718.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes cockroachdb#1718.
itsbilal added a commit that referenced this issue Jul 12, 2023
Previously we were relying on sleeps and timing-based ways of
synchronization between observing a stall in the disk health checking
goroutine and confirming for it in the test code itself. This change
adds a more direct synchronization between the two events through
the use of channels, to deflake both tests. Furthermore,
the TestDiskHealthChecking_Filesystem_Close test was previously
doing a relatively thread-unsafe use of a map, which increased
the chances of a flake.

Also closes diskHealthCheckingFiles created in some Create operations
to prevent goroutine leaks.

Removes some tests that tested for false positives on disk stalls,
even though scheduler delays can also cause perceived disk
stalls at the small thresholds that were used.

Also selectively skips two tests on windows that almost exclusively
just flaked there.

Fixes #1718.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants