sharness: add a per-test global timeout option #2833

trws · 2020-03-13T18:11:24Z

After recent frustration with a variety of hangs, this provides a new
environment variable FLUX_TEST_TIMEOUT that provides each individual
top-level sharness test (test_expect_success or similar) with an
individual timeout of that value in seconds. After that many seconds
the sharness script receives a signal, prints an error, kills the
current command, and proceeds to subsequent tests. The mechanism for
this is a little bit awkward because of the way that sharness uses eval
to execute tests, but has proven reliable in my tests so far.

grondo · 2020-03-13T21:33:55Z

The one problem with this is that we essentially vendor in sharness.sh. If we take this approach we might want to first submit the changes upstream, then pull in the new sharness.sh at some point.

sharness does allow extensions to be placed in sharness.d/, I wonder if we could get a similar effect by doing something in there. Or, we could use that facility to at least put a timeout on an entire sharness test.t test script..

trws · 2020-03-13T21:35:33Z

I could probably monkey-patch _test_run from sharness.d, rather rename it to something else and wrap it with a replacement of my own, if that would be more palatable?

trws · 2020-03-13T23:50:22Z

This is one possible rework @grondo, there's no way to portably wrap the function, so it just flat replaces it from flux-sharness.sh.

The other option is to make wrappers for all of the test_expect_* functions that provide this by wrapping them, then have a little fun with sed. What do you think?

grondo · 2020-03-14T02:10:15Z

It is pretty cool, but I still feel like it might be fraught with peril to replace an explicitly underscored function from sharness.sh. Since the main use case is timing out tests that would otherwise cause a hang in make check, could we instead optionally add something to our one-off config/tap-driver.sh?

I know it is a bummer that we can't time out individual test_expect_*, but timing out entire set of tests isn't too much worse is it?

grondo · 2020-03-14T02:22:49Z

I wouldn't want you to spend any more time trying to jump through hoops wrapping every test_expect_* function. If you feel strongly we need to time out individual tests then the original proposal was probably more acceptable. If/when we pull in a new sharness.sh, care will have to be taken to not drop the changes.

I'm still wondering if a simpler approach would still solve the immediate problem though. We'd know at least what "tNNN-*.t" test caused the hang...

grondo · 2020-03-14T16:21:57Z

@Mergifyio rebase

mergify · 2020-03-14T16:22:34Z

Command rebase: success

Branch has been successfully rebased

grondo · 2020-03-14T16:25:54Z

Not sure what is up, but the travis hangs seem to way worse in the last 24hrs, so I think I'd be inclined to accept your original proposal here @trws. For now, I'm going to rebase the current work a few times and see if we get failures to help diagnose the current issues.

codecov-io · 2020-03-14T16:47:26Z

Codecov Report

Merging #2833 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2833   +/-   ##
=======================================
  Coverage   81.00%   81.00%           
=======================================
  Files         250      250           
  Lines       39511    39511           
=======================================
  Hits        32004    32004           
  Misses       7507     7507

Impacted Files	Coverage Δ
src/common/libsubprocess/local.c	`79.59% <0.00%> (-0.35%)`	⬇️
src/broker/broker.c	`74.08% <0.00%> (-0.11%)`	⬇️
src/modules/job-info/guest_watch.c	`77.29% <0.00%> (+0.57%)`	⬆️

grondo · 2020-03-14T17:01:06Z

Unfortunately the current version doesn't work on centosX builders. bash/dash difference?

ERROR: t0000-sharness
=====================
./t0000-sharness.t: line 29:  8650 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
./t0000-sharness.t: line 33:  8652 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
./t0000-sharness.t: line 36:  8655 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )
ok 1 - sourcing sharness succeeds
PASS: t0000-sharness.t 1 - sourcing sharness succeeds
ok 2 - success is reported like this
PASS: t0000-sharness.t 2 - success is reported like this
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at /usr/src/t/test-terminal.perl line 5.
BEGIN failed--compilation aborted at /usr/src/t/test-terminal.perl line 5.
not ok 3 - pretend we have a known breakage # TODO known breakage
XFAIL: t0000-sharness.t 3 - pretend we have a known breakage # TODO known breakage
not ok 4 - pretend we have a fully passing test suite
FAIL: t0000-sharness.t 4 - pretend we have a fully passing test suite
#	
#		run_sub_test_lib_test full-pass '3 passing tests' <<-\EOF &&
#		for i in 1 2 3
#		do
#			test_expect_success "passing test #$i" 'true'
#		done
#		test_done
#		EOF
#		check_sub_test_lib_test full-pass <<-\EOF
#		> ok 1 - passing test #1
#		> ok 2 - passing test #2
#		> ok 3 - passing test #3
#		> # passed all 3 test(s)
#		> 1..3
#		EOF
#	
./sharness.sh: line 277:  8732 Terminated              ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do
    sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then
        break;
    fi; i=$(($i+1));
done; kill -ALRM $$ )

trws · 2020-03-14T21:08:44Z

I’ll check it out if I can get my daughter to nap. That failure looks really peculiar though, I wonder why Perl would be failing...

…

Sent from my iPhone On Mar 14, 2020, at 10:01 AM, Mark Grondona <[email protected]> wrote: Unfortunately the current version doesn't work on centosX builders. bash/dash difference? ERROR: t0000-sharness ===================== ./t0000-sharness.t: line 29: 8650 Terminated ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then break; fi; i=$(($i+1)); done; kill -ALRM $$ ) ./t0000-sharness.t: line 33: 8652 Terminated ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then break; fi; i=$(($i+1)); done; kill -ALRM $$ ) ./t0000-sharness.t: line 36: 8655 Terminated ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then break; fi; i=$(($i+1)); done; kill -ALRM $$ ) ok 1 - sourcing sharness succeeds PASS: t0000-sharness.t 1 - sourcing sharness succeeds ok 2 - success is reported like this PASS: t0000-sharness.t 2 - success is reported like this Can't locate IO/Pty.pm in @inc (you may need to install the IO::Pty module) (@inc contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at /usr/src/t/test-terminal.perl line 5. BEGIN failed--compilation aborted at /usr/src/t/test-terminal.perl line 5. not ok 3 - pretend we have a known breakage # TODO known breakage XFAIL: t0000-sharness.t 3 - pretend we have a known breakage # TODO known breakage not ok 4 - pretend we have a fully passing test suite FAIL: t0000-sharness.t 4 - pretend we have a fully passing test suite # # run_sub_test_lib_test full-pass '3 passing tests' <<-\EOF && # for i in 1 2 3 # do # test_expect_success "passing test #$i" 'true' # done # test_done # EOF # check_sub_test_lib_test full-pass <<-\EOF # > ok 1 - passing test #1 # > ok 2 - passing test #2 # > ok 3 - passing test #3 # > # passed all 3 test(s) # > 1..3 # EOF # ./sharness.sh: line 277: 8732 Terminated ( set -e; parent_pid=$$; i=0; while kill -0 $parent_pid; do sleep 1; if test "$i" -gt ${FLUX_TEST_TIMEOUT:-120}; then break; fi; i=$(($i+1)); done; kill -ALRM $$ ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2833 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFBFNIT6DKB732ANOFHAYLRHOZ57ANCNFSM4LHHYK2Q>.

grondo · 2020-03-14T21:35:30Z

No rush! Actually many of the tests were failing in the same way, I just pasted the first failure.

After recent frustration with a variety of hangs, this provides a new environment variable `FLUX_TEST_TIMEOUT` that provides each individual top-level sharness test (test_expect_success or similar) with an individual timeout of that value in seconds. After that many seconds the sharness script receives a signal, prints an error, kills the current command, and proceeds to subsequent tests. The mechanism for this is a little bit awkward because of the way that sharness uses eval to execute tests, but has proven reliable in my tests so far.

trws · 2020-03-15T23:37:35Z

Closing this since #2840 landed.

trws requested a review from grondo March 13, 2020 18:11

trws force-pushed the global-timeout branch 2 times, most recently from cf4601c to e95a70a Compare March 13, 2020 19:31

grondo force-pushed the global-timeout branch from f724b69 to 5d567ea Compare March 14, 2020 16:22

trws force-pushed the global-timeout branch from 5d567ea to e95a70a Compare March 14, 2020 23:36

trws force-pushed the global-timeout branch from e95a70a to 144c66d Compare March 14, 2020 23:38

trws mentioned this pull request Mar 14, 2020

tests: timeout in automake harness #2840

Merged

trws closed this Mar 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sharness: add a per-test global timeout option #2833

sharness: add a per-test global timeout option #2833

trws commented Mar 13, 2020

grondo commented Mar 13, 2020

trws commented Mar 13, 2020

trws commented Mar 13, 2020

grondo commented Mar 14, 2020 •

edited

Loading

grondo commented Mar 14, 2020 •

edited

Loading

grondo commented Mar 14, 2020

mergify bot commented Mar 14, 2020

grondo commented Mar 14, 2020

codecov-io commented Mar 14, 2020

grondo commented Mar 14, 2020

trws commented Mar 14, 2020 via email

grondo commented Mar 14, 2020

trws commented Mar 15, 2020

sharness: add a per-test global timeout option #2833

sharness: add a per-test global timeout option #2833

Conversation

trws commented Mar 13, 2020

grondo commented Mar 13, 2020

trws commented Mar 13, 2020

trws commented Mar 13, 2020

grondo commented Mar 14, 2020 • edited Loading

grondo commented Mar 14, 2020 • edited Loading

grondo commented Mar 14, 2020

mergify bot commented Mar 14, 2020

grondo commented Mar 14, 2020

codecov-io commented Mar 14, 2020

Codecov Report

grondo commented Mar 14, 2020

trws commented Mar 14, 2020 via email

grondo commented Mar 14, 2020

trws commented Mar 15, 2020

grondo commented Mar 14, 2020 •

edited

Loading

grondo commented Mar 14, 2020 •

edited

Loading