Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

kamalmarhubi · 2016-01-27T22:23:18Z

I've seen these fail occasionally

sys::test_wait::test_wait_signal
test_unistd::test_wait
test_unistd::test_execve

Maybe some others

The text was updated successfully, but these errors were encountered:

kamalmarhubi · 2016-01-31T23:02:11Z

Including output from @viraptor's report in #260:

test test_unistd::test_execve ... FAILED
test test_unistd::test_wait ... FAILED

failures:

---- test_unistd::test_execve stdout ----
        thread 'test_unistd::test_execve' panicked at 'called `Result::unwrap()` on an `Err` value: Sys(ECHILD)', ../src/libcore/result.rs:741

---- test_unistd::test_wait stdout ----
        thread 'test_unistd::test_wait' panicked at 'assertion failed: `(left == right)` (left: `Ok(Exited(2837, 0))`, right: `Ok(Exited(2843, 0))`)', test/test_unistd.rs:41


failures:
    test_unistd::test_execve
    test_unistd::test_wait

kamalmarhubi · 2016-02-27T17:40:05Z

new flaky test:

test sys::signal::tests::test_sigwait ... Process didn't exit successfully: `/Users/travis/build/kamalmarhubi/nix-rust/target/x86_64-apple-darwin/debug/nix-3f7e4935a20dbf98` (signal: 30)

Currently, several of the tests are failing intermittently. After some research it appears that these failures only occur when thread parallelism is enabled (as is the case by default). To test, I just ran the failing tests over and over. I would consistently see errors when running the following: $ while true; do target/debug/test-7ec4d9681e812f6a; done When I forced single threaded execution, I no longer saw failures: $ while true; do RUST_TEST_THREADS=1 target/debug/test-7ec4d9681e812f6a; done I was mostly looking at the test_unistd failures which make calls out to fork() and then make subsequent calls to wait(). In that case there is one parent and the wait() called could (and frequently does) get some random child pid back because it just happened to terminate. That is why when one of the test fails so does the other one. I couldn't think of an obvious fix other than preventing thread parallelism in the short term. The tests still run very quickly. nix-rust#251 Signed-off-by: Paul Osborne <[email protected]>

kamalmarhubi · 2016-03-05T20:09:32Z

@posborne noted in #292:

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait(). In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate. That is
why when one of the test fails so does the other one.

…hubi testing: increase stability by removing thread parallelism Currently, several of the tests are failing intermittently. After some research it appears that these failures only occur when thread parallelism is enabled (as is the case by default). To test, I just ran the failing tests over and over. I would consistently see errors when running the following: $ while true; do target/debug/test-7ec4d9681e812f6a; done When I forced single threaded execution, I no longer saw failures: $ while true; do RUST_TEST_THREADS=1 target/debug/test-7ec4d9681e812f6a; done I was mostly looking at the test_unistd failures which make calls out to fork() and then make subsequent calls to wait(). In that case there is one parent and the wait() called could (and frequently does) get some random child pid back because it just happened to terminate. That is why when one of the test fails so does the other one. I couldn't think of an obvious fix other than preventing thread parallelism in the short term. The tests still run very quickly. #251 Signed-off-by: Paul Osborne <[email protected]>

posborne · 2016-03-06T00:30:11Z

test_sigwait still seems to fail sometimes even with the single threaded execution. Must be a separate problem.

kamalmarhubi · 2016-03-08T23:53:22Z

Thoughts on these issues:

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait(). In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate.

I'm going to make them wait on the child they spawn instead of any old child. I'm still not sure what is going on in test_sigwait. I couldn't see any failures on my linux machine when running single-threaded.

dhylands · 2016-03-09T00:10:56Z

Those errors (mentioned on Jan 31) look like the ones I saw as well.

kamalmarhubi · 2016-03-09T00:13:20Z

@posborne

Calling the actual system calls only lets us test the happy path (and even then not always as consistently as we would like).

I have some ideas on this as well, but I'll check your link. I think it's a separate issue from the flaky / intermittent failures though.

dhylands · 2016-03-09T00:28:32Z

I'll bet its race where the SIGUSR1 is being delivered before the wait is called. The error:

build/kamalmarhubi/nix-rust/target/x86_64-apple-darwin/debug/nix-3f7e4935a20dbf98` (signal: 30)

seems to be saying that the process was killed by SIGUSR1. If a context switch happened after the raise and before the call to wait, then it's entirely possible that that signal gets processed before thewait, and since no one is yet waiting for the signal, it terminates the process.

But I'm really unfamiliar with the OSX kernel stuff, so I could be spouting nonsense.

kamalmarhubi · 2016-03-09T00:35:42Z

I opened #303 for the more general how-to-test issue.

kamalmarhubi · 2016-03-09T00:37:38Z

@dhylands that sounds plausible! Details of signals are outside my knowledge sphere. I know @fiveop was doing various signal-related changes though and so might have some more ideas.

fiveop · 2016-03-11T14:03:23Z

To me it looks like the test does not run in a single threaded process. I thought we changed the test behaviour to guarantee that?

kamalmarhubi · 2016-03-11T16:57:34Z

@fiveop yeah in #292.

They have four problems: * The chdir tests change the process's cwd, which is global. Protect them all with a mutex. * The wait tests will reap any subprocess, and several tests create subprocesses. Protect them all with a mutex so only one subprocess-creating test will run at a time. * When a multithreaded test forks, the child process can sometimes block in the stack unwinding code. It blocks on a mutex that was held by a different thread in the parent, but that thread doesn't exist in the child, so a deadlock results. Fix this by immediately calling std::process:exit in the child processes. * My previous attempt at thread safety in the aio tests didn't work, because anonymous MutexGuards drop immediately. Fix this by naming the SIGUSR2_MTX MutexGuards. Fixes nix-rust#251

638: Make aio, chdir, and wait tests thread safe r=Susurrus Fix thread safety issues in aio, chdir, and wait tests They have four problems: * The chdir tests change the process's cwd, which is global. Protect them all with a mutex. * The wait tests will reap any subprocess, and several tests create subprocesses. Protect them all with a mutex so only one subprocess-creating test will run at a time. * When a multithreaded test forks, the child process can sometimes block in the stack unwinding code. It blocks on a mutex that was held by a different thread in the parent, but that thread doesn't exist in the child, so a deadlock results. Fix this by immediately calling `std::process:;exit` in the child processes. * My previous attempt at thread safety in the aio tests didn't work, because anonymous MutexGuards drop immediately. Fix this by naming the SIGUSR2_MTX MutexGuards. Fixes #251

kamalmarhubi mentioned this issue Jan 31, 2016

Occasional wait test fail #260

Closed

posborne mentioned this issue Mar 5, 2016

testing: increase stability by removing thread parallelism #292

Merged

kamalmarhubi added the A-testing label Mar 6, 2016

kamalmarhubi mentioned this issue Mar 8, 2016

Add gettid #293

Merged

asomers mentioned this issue Jul 16, 2017

Make aio, chdir, and wait tests thread safe #638

Merged

asomers self-assigned this Jul 16, 2017

bors bot closed this as completed in #638 Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

kamalmarhubi commented Jan 27, 2016

kamalmarhubi commented Jan 31, 2016

kamalmarhubi commented Feb 27, 2016

kamalmarhubi commented Mar 5, 2016

posborne commented Mar 6, 2016

kamalmarhubi commented Mar 8, 2016

dhylands commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

dhylands commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

fiveop commented Mar 11, 2016

kamalmarhubi commented Mar 11, 2016

Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

Comments

kamalmarhubi commented Jan 27, 2016

kamalmarhubi commented Jan 31, 2016

kamalmarhubi commented Feb 27, 2016

kamalmarhubi commented Mar 5, 2016

posborne commented Mar 6, 2016

kamalmarhubi commented Mar 8, 2016

dhylands commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

dhylands commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

kamalmarhubi commented Mar 9, 2016

fiveop commented Mar 11, 2016

kamalmarhubi commented Mar 11, 2016