Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky tests: test_wait_signal, test_wait, test_execve #251

Closed
kamalmarhubi opened this issue Jan 27, 2016 · 12 comments
Closed
Assignees

Comments

@kamalmarhubi
Copy link
Member

I've seen these fail occasionally

  • sys::test_wait::test_wait_signal
  • test_unistd::test_wait
  • test_unistd::test_execve

Maybe some others

@kamalmarhubi
Copy link
Member Author

Including output from @viraptor's report in #260:

test test_unistd::test_execve ... FAILED
test test_unistd::test_wait ... FAILED

failures:

---- test_unistd::test_execve stdout ----
        thread 'test_unistd::test_execve' panicked at 'called `Result::unwrap()` on an `Err` value: Sys(ECHILD)', ../src/libcore/result.rs:741

---- test_unistd::test_wait stdout ----
        thread 'test_unistd::test_wait' panicked at 'assertion failed: `(left == right)` (left: `Ok(Exited(2837, 0))`, right: `Ok(Exited(2843, 0))`)', test/test_unistd.rs:41


failures:
    test_unistd::test_execve
    test_unistd::test_wait

@kamalmarhubi
Copy link
Member Author

new flaky test:

test sys::signal::tests::test_sigwait ... Process didn't exit successfully: `/Users/travis/build/kamalmarhubi/nix-rust/target/x86_64-apple-darwin/debug/nix-3f7e4935a20dbf98` (signal: 30)

posborne added a commit to posborne/nix-rust that referenced this issue Mar 5, 2016
Currently, several of the tests are failing intermittently.  After
some research it appears that these failures only occur when thread
parallelism is enabled (as is the case by default).  To test, I just
ran the failing tests over and over.  I would consistently see errors
when running the following:

    $ while true; do target/debug/test-7ec4d9681e812f6a; done

When I forced single threaded execution, I no longer saw failures:

    $ while true; do RUST_TEST_THREADS=1 target/debug/test-7ec4d9681e812f6a; done

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait().  In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate.  That is
why when one of the test fails so does the other one.

I couldn't think of an obvious fix other than preventing thread
parallelism in the short term.  The tests still run very quickly.

nix-rust#251

Signed-off-by: Paul Osborne <[email protected]>
@kamalmarhubi
Copy link
Member Author

@posborne noted in #292:

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait(). In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate. That is
why when one of the test fails so does the other one.

homu added a commit that referenced this issue Mar 5, 2016
…hubi

testing: increase stability by removing thread parallelism

Currently, several of the tests are failing intermittently.  After
some research it appears that these failures only occur when thread
parallelism is enabled (as is the case by default).  To test, I just
ran the failing tests over and over.  I would consistently see errors
when running the following:

    $ while true; do target/debug/test-7ec4d9681e812f6a; done

When I forced single threaded execution, I no longer saw failures:

    $ while true; do RUST_TEST_THREADS=1 target/debug/test-7ec4d9681e812f6a; done

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait().  In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate.  That is
why when one of the test fails so does the other one.

I couldn't think of an obvious fix other than preventing thread
parallelism in the short term.  The tests still run very quickly.

#251

Signed-off-by: Paul Osborne <[email protected]>
@posborne
Copy link
Member

posborne commented Mar 6, 2016

test_sigwait still seems to fail sometimes even with the single threaded execution. Must be a separate problem.

@kamalmarhubi
Copy link
Member Author

Thoughts on these issues:

I was mostly looking at the test_unistd failures which make calls out
to fork() and then make subsequent calls to wait(). In that case there
is one parent and the wait() called could (and frequently does) get some
random child pid back because it just happened to terminate.

I'm going to make them wait on the child they spawn instead of any old child. I'm still not sure what is going on in test_sigwait. I couldn't see any failures on my linux machine when running single-threaded.

@dhylands
Copy link
Contributor

dhylands commented Mar 9, 2016

Those errors (mentioned on Jan 31) look like the ones I saw as well.

@kamalmarhubi
Copy link
Member Author

@posborne

Calling the actual system calls only lets us test the happy path (and even then not always as consistently as we would like).

I have some ideas on this as well, but I'll check your link. I think it's a separate issue from the flaky / intermittent failures though.

@dhylands
Copy link
Contributor

dhylands commented Mar 9, 2016

I'll bet its race where the SIGUSR1 is being delivered before the wait is called. The error:

build/kamalmarhubi/nix-rust/target/x86_64-apple-darwin/debug/nix-3f7e4935a20dbf98` (signal: 30)

seems to be saying that the process was killed by SIGUSR1. If a context switch happened after the raise and before the call to wait, then it's entirely possible that that signal gets processed before thewait, and since no one is yet waiting for the signal, it terminates the process.

But I'm really unfamiliar with the OSX kernel stuff, so I could be spouting nonsense.

@kamalmarhubi
Copy link
Member Author

I opened #303 for the more general how-to-test issue.

@kamalmarhubi
Copy link
Member Author

@dhylands that sounds plausible! Details of signals are outside my knowledge sphere. I know @fiveop was doing various signal-related changes though and so might have some more ideas.

@fiveop
Copy link
Contributor

fiveop commented Mar 11, 2016

To me it looks like the test does not run in a single threaded process. I thought we changed the test behaviour to guarantee that?

@kamalmarhubi
Copy link
Member Author

@fiveop yeah in #292.

asomers added a commit to asomers/nix that referenced this issue Jul 16, 2017
They have four problems:

* The chdir tests change the process's cwd, which is global.  Protect them
  all with a mutex.

* The wait tests will reap any subprocess, and several tests create
  subprocesses.  Protect them all with a mutex so only one
  subprocess-creating test will run at a time.

* When a multithreaded test forks, the child process can sometimes block in
  the stack unwinding code.  It blocks on a mutex that was held by a
  different thread in the parent, but that thread doesn't exist in the
  child, so a deadlock results.  Fix this by immediately calling
  std::process:exit in the child processes.

* My previous attempt at thread safety in the aio tests didn't work, because
  anonymous MutexGuards drop immediately.  Fix this by naming the
  SIGUSR2_MTX MutexGuards.

Fixes nix-rust#251
@asomers asomers self-assigned this Jul 16, 2017
bors bot added a commit that referenced this issue Jul 18, 2017
638: Make aio, chdir, and wait tests thread safe r=Susurrus

Fix thread safety issues in aio, chdir, and wait tests
    
They have four problems:
    
* The chdir tests change the process's cwd, which is global.  Protect them all with a mutex.
    
* The wait tests will reap any subprocess, and several tests create subprocesses.  Protect them all with a mutex so only one subprocess-creating test will run at a time.
    
* When a multithreaded test forks, the child process can sometimes block in the stack unwinding code.  It blocks on a mutex that was held by a different thread in the parent, but that thread doesn't exist in the child, so a deadlock results.  Fix this by immediately calling `std::process:;exit` in the child processes.
    
* My previous attempt at thread safety in the aio tests didn't work, because anonymous MutexGuards drop immediately.  Fix this by naming the SIGUSR2_MTX MutexGuards.

Fixes #251
@bors bors bot closed this as completed in #638 Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants