Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharness: add a per-test global timeout option #2833

Closed
wants to merge 1 commit into from

Conversation

trws
Copy link
Member

@trws trws commented Mar 13, 2020

After recent frustration with a variety of hangs, this provides a new
environment variable FLUX_TEST_TIMEOUT that provides each individual
top-level sharness test (test_expect_success or similar) with an
individual timeout of that value in seconds. After that many seconds
the sharness script receives a signal, prints an error, kills the
current command, and proceeds to subsequent tests. The mechanism for
this is a little bit awkward because of the way that sharness uses eval
to execute tests, but has proven reliable in my tests so far.

@trws trws requested a review from grondo March 13, 2020 18:11
@trws trws force-pushed the global-timeout branch 2 times, most recently from cf4601c to e95a70a Compare March 13, 2020 19:31
@grondo
Copy link
Contributor

grondo commented Mar 13, 2020

The one problem with this is that we essentially vendor in sharness.sh. If we take this approach we might want to first submit the changes upstream, then pull in the new sharness.sh at some point.

sharness does allow extensions to be placed in sharness.d/, I wonder if we could get a similar effect by doing something in there. Or, we could use that facility to at least put a timeout on an entire sharness test.t test script..

@trws
Copy link
Member Author

trws commented Mar 13, 2020

I could probably monkey-patch _test_run from sharness.d, rather rename it to something else and wrap it with a replacement of my own, if that would be more palatable?

@trws
Copy link
Member Author

trws commented Mar 13, 2020

This is one possible rework @grondo, there's no way to portably wrap the function, so it just flat replaces it from flux-sharness.sh.

The other option is to make wrappers for all of the test_expect_* functions that provide this by wrapping them, then have a little fun with sed. What do you think?

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

It is pretty cool, but I still feel like it might be fraught with peril to replace an explicitly underscored function from sharness.sh. Since the main use case is timing out tests that would otherwise cause a hang in make check, could we instead optionally add something to our one-off config/tap-driver.sh?

I know it is a bummer that we can't time out individual test_expect_*, but timing out entire set of tests isn't too much worse is it?

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

I wouldn't want you to spend any more time trying to jump through hoops wrapping every test_expect_* function. If you feel strongly we need to time out individual tests then the original proposal was probably more acceptable. If/when we pull in a new sharness.sh, care will have to be taken to not drop the changes.

I'm still wondering if a simpler approach would still solve the immediate problem though. We'd know at least what "tNNN-*.t" test caused the hang...

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Mar 14, 2020

Command rebase: success

Branch has been successfully rebased

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

Not sure what is up, but the travis hangs seem to way worse in the last 24hrs, so I think I'd be inclined to accept your original proposal here @trws. For now, I'm going to rebase the current work a few times and see if we get failures to help diagnose the current issues.

@codecov-io
Copy link

Codecov Report

Merging #2833 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2833   +/-   ##
=======================================
  Coverage   81.00%   81.00%           
=======================================
  Files         250      250           
  Lines       39511    39511           
=======================================
  Hits        32004    32004           
  Misses       7507     7507           
Impacted Files Coverage Δ
src/common/libsubprocess/local.c 79.59% <0.00%> (-0.35%) ⬇️
src/broker/broker.c 74.08% <0.00%> (-0.11%) ⬇️
src/modules/job-info/guest_watch.c 77.29% <0.00%> (+0.57%) ⬆️

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

Unfortunately the current version doesn't work on centosX builders. bash/dash difference?

ERROR: t0000-sharness
=====================
./t0000-sharness.t: line 29:  8650 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
./t0000-sharness.t: line 33:  8652 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
./t0000-sharness.t: line 36:  8655 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
ok 1 - sourcing sharness succeeds
PASS: t0000-sharness.t 1 - sourcing sharness succeeds
ok 2 - success is reported like this
PASS: t0000-sharness.t 2 - success is reported like this
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at /usr/src/t/test-terminal.perl line 5.
BEGIN failed--compilation aborted at /usr/src/t/test-terminal.perl line 5.
not ok 3 - pretend we have a known breakage # TODO known breakage
XFAIL: t0000-sharness.t 3 - pretend we have a known breakage # TODO known breakage
not ok 4 - pretend we have a fully passing test suite
FAIL: t0000-sharness.t 4 - pretend we have a fully passing test suite
#	
#		run_sub_test_lib_test full-pass '3 passing tests' <<-\EOF &&
#		for i in 1 2 3
#		do
#			test_expect_success "passing test #$i" 'true'
#		done
#		test_done
#		EOF
#		check_sub_test_lib_test full-pass <<-\EOF
#		> ok 1 - passing test #1
#		> ok 2 - passing test #2
#		> ok 3 - passing test #3
#		> # passed all 3 test(s)
#		> 1..3
#		EOF
#	
./sharness.sh: line 277:  8732 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )

@trws
Copy link
Member Author

trws commented Mar 14, 2020 via email

@grondo
Copy link
Contributor

grondo commented Mar 14, 2020

No rush! Actually many of the tests were failing in the same way, I just pasted the first failure.

After recent frustration with a variety of hangs, this provides a new
environment variable `FLUX_TEST_TIMEOUT` that provides each individual
top-level sharness test (test_expect_success or similar) with an
individual timeout of that value in seconds.  After that many seconds
the sharness script receives a signal, prints an error, kills the
current command, and proceeds to subsequent tests.  The mechanism for
this is a little bit awkward because of the way that sharness uses eval
to execute tests, but has proven reliable in my tests so far.
@trws
Copy link
Member Author

trws commented Mar 15, 2020

Closing this since #2840 landed.

@trws trws closed this Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants