-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vfs: flaky test TestDiskHealthChecking_Filesystem #1718
Comments
Possibly related to #1703. |
Having trouble reproducing locally. Maybe in the shared-cpu environment of CI, the goroutine scheduling is just less predictable and the ticker routine just doesn't get scheduled within the 50ms :/
|
Another one on macOS: https://github.com/cockroachdb/pebble/runs/7083257581?check_suite_focus=true --- FAIL: TestDiskHealthChecking_Filesystem (0.44s)
--- FAIL: TestDiskHealthChecking_Filesystem/remove (0.05s)
assertion_compare.go:313:
Error Trace: disk_health_test.go:276
Error: "0" is not greater than "0"
Test: TestDiskHealthChecking_Filesystem/remove
Messages: [] |
Managed to reproduce this on my macbook. Looks like the same run caught another test failure: $ go test -mod=vendor -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run TestDiskHealthChecking_Filesystem ./vfs
...
/var/folders/ps/wznnm1sn4_g1rn8jjlgx57rh0000gq/T/go-stress-20220627T170345-028551772
=== RUN TestDiskHealthChecking_Filesystem
=== RUN TestDiskHealthChecking_Filesystem/rename
=== RUN TestDiskHealthChecking_Filesystem/reuse-for-write
=== RUN TestDiskHealthChecking_Filesystem/create
=== RUN TestDiskHealthChecking_Filesystem/link
=== RUN TestDiskHealthChecking_Filesystem/mkdir-all
assertion_compare.go:313:
Error Trace: disk_health_test.go:276
Error: "0" is not greater than "0"
Test: TestDiskHealthChecking_Filesystem/mkdir-all
Messages: []
=== RUN TestDiskHealthChecking_Filesystem/remove
assertion_compare.go:313:
Error Trace: disk_health_test.go:276
Error: "0" is not greater than "0"
Test: TestDiskHealthChecking_Filesystem/remove
Messages: []
=== RUN TestDiskHealthChecking_Filesystem/remove-all
--- FAIL: TestDiskHealthChecking_Filesystem (0.56s)
--- PASS: TestDiskHealthChecking_Filesystem/rename (0.11s)
--- PASS: TestDiskHealthChecking_Filesystem/reuse-for-write (0.05s)
--- PASS: TestDiskHealthChecking_Filesystem/create (0.08s)
--- PASS: TestDiskHealthChecking_Filesystem/link (0.08s)
--- FAIL: TestDiskHealthChecking_Filesystem/mkdir-all (0.07s)
--- FAIL: TestDiskHealthChecking_Filesystem/remove (0.07s)
--- PASS: TestDiskHealthChecking_Filesystem/remove-all (0.05s)
=== RUN TestDiskHealthChecking_Filesystem_Close
--- PASS: TestDiskHealthChecking_Filesystem_Close (0.15s)
FAIL
ERROR: exit status 1
34m5s: 3902 runs so far, 2 failures (0.05%) This doesn't happen consistently / frequently enough to warrant too much investigation right now, but it's good to know that this is somewhat reproducible. |
Haven't seen this one before: --- FAIL: TestDiskHealthChecking_Filesystem_Close (0.17s)
disk_health_test.go:310:
Error Trace: disk_health_test.go:310
Error: map[string]time.Duration{"bar":40320000, "foo":56489000} does not contain "bax"
Test: TestDiskHealthChecking_Filesystem_Close |
I hit this once when developing #2255. Logs: https://github.com/cockroachdb/pebble/actions/runs/3999009992/jobs/6862370894. |
Another one:
|
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of a channel, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Also deletes TestDiskHealthChecking_File_* variants that tested for lack of false positives, as given the tiny stall detection thresholds in the test, spurious stalls are bound to happen on busy nodes. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes cockroachdb#1718.
Previously we were relying on sleeps and timing-based ways of synchronization between observing a stall in the disk health checking goroutine and confirming for it in the test code itself. This change adds a more direct synchronization between the two events through the use of channels, to deflake both tests. Furthermore, the TestDiskHealthChecking_Filesystem_Close test was previously doing a relatively thread-unsafe use of a map, which increased the chances of a flake. Also closes diskHealthCheckingFiles created in some Create operations to prevent goroutine leaks. Removes some tests that tested for false positives on disk stalls, even though scheduler delays can also cause perceived disk stalls at the small thresholds that were used. Also selectively skips two tests on windows that almost exclusively just flaked there. Fixes #1718.
See: https://github.com/cockroachdb/pebble/runs/6563748979?check_suite_focus=true
The text was updated successfully, but these errors were encountered: