vfs: flaky test TestDiskHealthChecking_Filesystem #1718

nicktrav · 2022-05-23T22:22:25Z

See: https://github.com/cockroachdb/pebble/runs/6563748979?check_suite_focus=true

--- FAIL: TestDiskHealthChecking_Filesystem (0.41s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove-all (0.05s)
        assertion_compare.go:211: 
            	Error Trace:	disk_health_test.go:276
            	Error:      	"0" is not greater than "0"
            	Test:       	TestDiskHealthChecking_Filesystem/remove-all
            	Messages:   	[]

nicktrav · 2022-05-23T22:22:49Z

Possibly related to #1703.

jbowens · 2022-05-27T13:50:06Z

Having trouble reproducing locally. Maybe in the shared-cpu environment of CI, the goroutine scheduling is just less predictable and the ticker routine just doesn't get scheduled within the 50ms :/

1594 runs so far, 0 failures, over 10m0s

nicktrav · 2022-06-27T23:59:52Z

Another one on macOS: https://github.com/cockroachdb/pebble/runs/7083257581?check_suite_focus=true

--- FAIL: TestDiskHealthChecking_Filesystem (0.44s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove (0.05s)
        assertion_compare.go:313: 
            	Error Trace:	disk_health_test.go:276
            	Error:      	"0" is not greater than "0"
            	Test:       	TestDiskHealthChecking_Filesystem/remove
            	Messages:   	[]

nicktrav · 2022-06-28T02:06:00Z

Managed to reproduce this on my macbook. Looks like the same run caught another test failure:

$ go test -mod=vendor -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run TestDiskHealthChecking_Filesystem ./vfs
...
/var/folders/ps/wznnm1sn4_g1rn8jjlgx57rh0000gq/T/go-stress-20220627T170345-028551772
=== RUN   TestDiskHealthChecking_Filesystem
=== RUN   TestDiskHealthChecking_Filesystem/rename
=== RUN   TestDiskHealthChecking_Filesystem/reuse-for-write
=== RUN   TestDiskHealthChecking_Filesystem/create
=== RUN   TestDiskHealthChecking_Filesystem/link
=== RUN   TestDiskHealthChecking_Filesystem/mkdir-all
    assertion_compare.go:313:
                Error Trace:    disk_health_test.go:276
                Error:          "0" is not greater than "0"
                Test:           TestDiskHealthChecking_Filesystem/mkdir-all
                Messages:       []
=== RUN   TestDiskHealthChecking_Filesystem/remove
    assertion_compare.go:313:
                Error Trace:    disk_health_test.go:276
                Error:          "0" is not greater than "0"
                Test:           TestDiskHealthChecking_Filesystem/remove
                Messages:       []
=== RUN   TestDiskHealthChecking_Filesystem/remove-all
--- FAIL: TestDiskHealthChecking_Filesystem (0.56s)
    --- PASS: TestDiskHealthChecking_Filesystem/rename (0.11s)
    --- PASS: TestDiskHealthChecking_Filesystem/reuse-for-write (0.05s)
    --- PASS: TestDiskHealthChecking_Filesystem/create (0.08s)
    --- PASS: TestDiskHealthChecking_Filesystem/link (0.08s)
    --- FAIL: TestDiskHealthChecking_Filesystem/mkdir-all (0.07s)
    --- FAIL: TestDiskHealthChecking_Filesystem/remove (0.07s)
    --- PASS: TestDiskHealthChecking_Filesystem/remove-all (0.05s)
=== RUN   TestDiskHealthChecking_Filesystem_Close
--- PASS: TestDiskHealthChecking_Filesystem_Close (0.15s)
FAIL


ERROR: exit status 1

34m5s: 3902 runs so far, 2 failures (0.05%)

This doesn't happen consistently / frequently enough to warrant too much investigation right now, but it's good to know that this is somewhat reproducible.

nicktrav · 2022-11-09T19:17:46Z

https://github.com/cockroachdb/pebble/actions/runs/3431084191/jobs/5718804698

nicktrav · 2023-01-12T00:11:47Z

Haven't seen this one before:

--- FAIL: TestDiskHealthChecking_Filesystem_Close (0.17s)
    disk_health_test.go:310: 
        	Error Trace:	disk_health_test.go:310
        	Error:      	map[string]time.Duration{"bar":40320000, "foo":56489000} does not contain "bax"
        	Test:       	TestDiskHealthChecking_Filesystem_Close

joshimhoff · 2023-01-25T21:36:44Z

I hit this once when developing #2255. Logs: https://github.com/cockroachdb/pebble/actions/runs/3999009992/jobs/6862370894.

bananabrick · 2023-05-17T19:39:26Z

Another one:

2023-05-17T19:32:54.5893691Z === RUN   TestDiskHealthChecking_Filesystem_Close
2023-05-17T19:32:54.5894097Z     disk_health_test.go:536: 
2023-05-17T19:32:54.5894560Z         	Error Trace:	disk_health_test.go:536
2023-05-17T19:32:54.5895267Z         	Error:      	map[string]time.Duration{"bar":47563100, "foo":47101600} does not contain "bax"
2023-05-17T19:32:54.5896012Z         	Test:       	TestDiskHealthChecking_Filesystem_Close

RaduBerinde · 2023-06-26T20:32:17Z

One on windows:
https://github.com/cockroachdb/pebble/actions/runs/5382449219/jobs/9767882012?pr=2673

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of a channel, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Fixes cockroachdb#1718.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Also deletes TestDiskHealthChecking_File_* variants that tested for lack of false positives, as given the tiny stall detection thresholds in the test, spurious stalls are bound to happen on busy nodes. Fixes cockroachdb#1718.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Fixes cockroachdb#1718.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.

Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes #1718.

nicktrav mentioned this issue Jun 28, 2022

vfs: flaky test TestDiskHealthChecking_Filesystem_Close on windows #1703

Closed

jbowens mentioned this issue Aug 16, 2022

windows: test flakes / failures due to out of memory #1897

Closed

nicktrav mentioned this issue Jan 25, 2023

TestDiskHealthChecking_Filesystem/remove might be flaky #2272

Closed

jbowens mentioned this issue Jul 10, 2023

TestDiskHealthChecking_Filesystem_Close flaked #2716

Closed

itsbilal self-assigned this Jul 11, 2023

itsbilal mentioned this issue Jul 12, 2023

vfs: Deflake TestDiskHealthChecking_File* #2734

Merged

itsbilal closed this as completed in #2734 Jul 12, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

nicktrav commented May 23, 2022

nicktrav commented May 23, 2022

jbowens commented May 27, 2022

nicktrav commented Jun 27, 2022

nicktrav commented Jun 28, 2022

nicktrav commented Nov 9, 2022

nicktrav commented Jan 12, 2023

joshimhoff commented Jan 25, 2023 •

edited

Loading

bananabrick commented May 17, 2023

RaduBerinde commented Jun 26, 2023

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

Comments

nicktrav commented May 23, 2022

nicktrav commented May 23, 2022

jbowens commented May 27, 2022

nicktrav commented Jun 27, 2022

nicktrav commented Jun 28, 2022

nicktrav commented Nov 9, 2022

nicktrav commented Jan 12, 2023

joshimhoff commented Jan 25, 2023 • edited Loading

bananabrick commented May 17, 2023

RaduBerinde commented Jun 26, 2023

joshimhoff commented Jan 25, 2023 •

edited

Loading