Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default test timeout #32973

Closed
orlitzky opened this issue Dec 4, 2021 · 21 comments · Fixed by #36223
Closed

Increase default test timeout #32973

orlitzky opened this issue Dec 4, 2021 · 21 comments · Fixed by #36223

Comments

@orlitzky
Copy link
Contributor

orlitzky commented Dec 4, 2021

Having just run through the default (non --long) test suite for the first time in a while, I've hit...

sage -t --random-seed=... src/sage/manifolds/differentiable/tensorfield.py
    [1019 tests, 364.13 s]

which runs afoul of our default timeout in src/sage/doctest/control.py:

options.timeout = int(os.getenv('SAGE_TIMEOUT', 5 * 60))

There's nothing wrong in that file; it just has a lot of tests. Even after #32967, I think it would be safer to go straight to 7 minutes, because the runtime is more or less nondecreasing.

Component: doctest framework

Author: Michael Orlitzky

Branch/Commit: u/mjo/ticket/32973 @ c3b0124

Issue created by migration from https://trac.sagemath.org/ticket/32973

@orlitzky orlitzky added this to the sage-9.5 milestone Dec 4, 2021
@tornaria
Copy link
Contributor

tornaria commented Dec 5, 2021

comment:1

What about having a "hard" timeout of 7 minutes, but keep a "soft" timeout of 5m (or less?)

Meaning: tests run to the end for 7 minutes, but files taking more than 5 are reported at the end as "taking too long". In effect, this means the whole file is doctested, but it still shows as an error/warning at the end.

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 5, 2021

comment:2

Replying to @tornaria:

What about having a "hard" timeout of 7 minutes, but keep a "soft" timeout of 5m (or less?)

Meaning: tests run to the end for 7 minutes, but files taking more than 5 are reported at the end as "taking too long". In effect, this means the whole file is doctested, but it still shows as an error/warning at the end.

I don't think a warning about files that take a long time is too useful, because what would we do to fix them?

We have --warn-long to warn us about individual doctests that take a long time. But the time it takes to test a file is roughly proportional to the number of tests it contains, or (even more roughly) proportional to the length of the file itself. There's nothing wrong with a large file per se, or with a lot of tests, so there's really nothing inherently wrong with a file that takes a long time to test as a whole.

As far as I know, the --timeout value (as opposed to the --warn-long value) is just there to prevent the patchbots from getting stuck in an infinite loop. So it can be excessively high without hurting anything. The value of --warn-long, on the other hand, should be low enough to catch inefficient tests that can be fixed.

So, ultimately, I think we should be setting the default timeouts so that make ptest and make ptestlong reliably pass on the slowest machine that anyone actually uses. I still regularly use a Thinkpad x61s, so I may be that guy, although even that old laptop is x64.

@tornaria
Copy link
Contributor

tornaria commented Dec 6, 2021

comment:3

What you say makes perfect sense, thanks for the explanation.

I was worried that a slow doctest like in #32964 would be missed. In my experience it only occasionally timed out on i686 and it never gave any warning on x86_64. Maybe this just means the --warn-long default is set too high. It does not seem reasonable that a particular file times out because of a single very slow test, but there's no warning for this single test (I can imagine the culprit being not necessarily the last test that was run.)

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 6, 2021

comment:4

It does look like warn_long is set too high when --long is not given; 60s if I'm not mistaken. I'll see if that can't be improved too.

But I think the main reason that I found #32964 and others like it is because nobody else is running the test suite without --long. All of the patchbots and GH actions run ptestlong. I accidentally ran it without --long one day so now I'm finding all the bugs..

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 6, 2021

Author: Michael Orlitzky

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 6, 2021

Commit: c3b0124

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 6, 2021

Branch: u/mjo/ticket/32973

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 6, 2021

comment:5

I've gone all the way to 10 minutes. With make ptest on my desktop (AMD Phenom II X4), tensorfield.py already approaches eight minutes (~460s). If I watch a movie or browse the web simultaneously it gets worse.

I still plan to check the timings on my laptop, probably tonight, to make sure that 10 is enough.


New commits:

c3b0124Trac #32973: increase the default test timeout to ten minutes.

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 7, 2021

comment:6

The laptop (Core II duo @ 1.8ghz) actually fares better than my desktop, but with fewer threads (2 vs 4). So ten minutes looks like a good number.

@tornaria
Copy link
Contributor

tornaria commented Dec 8, 2021

comment:7

Replying to @orlitzky:

It does look like warn_long is set too high when --long is not given; 60s if I'm not mistaken. I'll see if that can't be improved too.

On my desktop is set to ~ 36s (it varies a little bit).

But I think the main reason that I found #32964 and others like it is because nobody else is running the test suite without --long. All of the patchbots and GH actions run ptestlong. I accidentally ran it without --long one day so now I'm finding all the bugs..

In the void package I'm running the test suite using ./sage -tp $JOBS --all. But the output is very long. A summary of errors shows up at the end so one knows what to look for, but the slow doctest warning is not.

It might be useful if:

a. The slow doctest warning is summarized at the end (just mentioning the file is useful enough so one can search for it in the log).

b. There's an option so that the check exits with an error on warnings so this is catched by the CI.

c. Even if test is run with --long, the doctests that are NOT labeled long still emit a warning if they take more time than the warn_long time.

@tornaria
Copy link
Contributor

tornaria commented Dec 8, 2021

comment:8

Replying to @orlitzky:

New commits:

c3b0124Trac #32973: increase the default test timeout to ten minutes.

LGTM, although I should mention that I have a box where sage --tp 32 --all takes < 4m wall time so a rogue test taking 10 minutes to time out will double the wait :-)

Maybe the time can be adaptive (based on second_on_modern_computer()?) like the time for --warn-long is.

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 8, 2021

comment:9

Replying to @tornaria:

On my desktop is set to ~ 36s (it varies a little bit).

There is some voodoo that happens in sage/doctest/control.py to determine that value. Currently, if your machine has stored timing information from previous test runs, and if not too many of those tests failed, then sixty times second_on_modern_computer() is used. But even that number is based on a quad core i7 from 2014. And if you don't have timing information, or you have too many test failures, it uses no --warn-long value at all.

My problem is that I usually have too many test failures for it to use the stored timings (testing development branches and package upgrades all the time), so it gives me no --warn-long at all. I've tried to fix this in #32981, but without second_on_modern_computer(), I'm really just guessing at the limit. A CPU-time limit would be a lot easier, but normal users won't know WTF that is.

It might be useful if:

c. Even if test is run with --long, the doctests that are NOT labeled long still emit a warning if they take more time than the warn_long time.

I think this one already happens? The other two are good ideas, I opened #32995 for those.

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 8, 2021

comment:10

Replying to @tornaria:

Replying to @orlitzky:

New commits:

c3b0124 Trac #32973: increase the default test timeout to ten minutes.

LGTM, although I should mention that I have a box where sage --tp 32 --all takes < 4m wall time so a rogue test taking 10 minutes to time out will double the wait :-)

Maybe the time can be adaptive (based on second_on_modern_computer()?) like the time for --warn-long is.

We would still need a fallback if the timing information isn't usable -- if second_on_modern_computer() throws an error. In this case we use this limit for its intended purpose so infrequently that I don't think it's worth the extra trouble; we can just use the fallback value and skip the CPU timing heuristic.

I'm sure it happened once but I don't recall any instances where someone, accidentally or otherwise, uploaded a branch with dangerous code and sent it to the patchbots.

@tornaria
Copy link
Contributor

tornaria commented Dec 8, 2021

comment:11

Replying to @orlitzky:

Replying to @tornaria:

On my desktop is set to ~ 36s (it varies a little bit).

There is some voodoo that happens in sage/doctest/control.py to determine that value. Currently, if your machine has stored timing information from previous test runs, and if not too many of those tests failed, then sixty times second_on_modern_computer() is used. But even that number is based on a quad core i7 from 2014. And if you don't have timing information, or you have too many test failures, it uses no --warn-long value at all.

Aha! That's it. I run tests on a clean chroot with a just built sagemath. Hence the warn_long time defaults to infinity which is bad. Can we make the default to a sane value? Even 60s would be small enough to catch #32964 on a fast machine but maybe it should be smaller.

It's kind of weird that doctesting needs second_on_modern_computer() but the latter needs doctesting. I like reproducible builds so I try to always build and run on a clean chroot.

My problem is that I usually have too many test failures for it to use the stored timings (testing development branches and package upgrades all the time), so it gives me no --warn-long at all. I've tried to fix this in #32981, but without second_on_modern_computer(), I'm really just guessing at the limit. A CPU-time limit would be a lot easier, but normal users won't know WTF that is.

There ought to be a simple way to get a good enough approximation to second_on_modern_computer() that is usable when stats have not been collected. This is just a timeout not for benchmarks so anything approximate enough that gives a reasonable small value on a fast computer while still allowing enough time on slower computers would be good. Then both warn_long and the per-file timeout can be multiple of that so both fast computers and slow computers are happy. And it can still be "more precise" when doctest timings are available. I mean: just bogomips may be good enough for this purpose (i.e. better than infinity or constant).

c. Even if test is run with --long, the doctests that are NOT labeled long still emit a warning if they take more time than the warn_long time.

I think this one already happens? The other two are good ideas, I opened #32995 for those.

I didn't check so it may be true.

@tornaria
Copy link
Contributor

tornaria commented Dec 8, 2021

comment:12

TL;DR the effective default for second_on_modern_computer() is

  • infinity (for warn_long)
  • constant (for timeout)
    I propose:
  • change all the RuntimeErrors in second_on_omodern_computer() by something nonconstant (e.g. a multiple of 1/bogomips)
  • set the default test timeout as a multiple of second_on_modern_computer()

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 8, 2021

comment:13

Replying to @tornaria:

TL;DR the effective default for second_on_modern_computer() is

  • infinity (for warn_long)
  • constant (for timeout)
    I propose:
  • change all the RuntimeErrors in second_on_omodern_computer() by something nonconstant (e.g. a multiple of 1/bogomips)
  • set the default test timeout as a multiple of second_on_modern_computer()

This is good solution, but the devil is in the details. BogoMIPS won't be available on Windows, for example. What else can we use?

The whole thing is really a mess: when you allow sage to choose your --warn-long value, it's essentially trying to use second_on_modern_computer() to turn the wall-time default into a meaningful CPU-time value. But why does --warn-long take a wall-time measurement in the first place? Users may not know about CPU time... but I don't really see how specifying a wall-time is useful: it has to be converted into CPU time to work on other machines.

The --timeout value, on the other hand, has to be wall-time, because the doctest runner looks at the wall time to see if the process should be killed yet. Ugh. But at least no harm is done here if we set the value too high.

@orlitzky
Copy link
Contributor Author

orlitzky commented Dec 8, 2021

comment:14

I've switched --warn-long to CPU time in #32981. That should nicely fix half of the mess, leaving only the question of how to use second_on_modern_computer() to adjust the timeout.

@tornaria
Copy link
Contributor

tornaria commented Dec 9, 2021

comment:15

Replying to @orlitzky:

Replying to @tornaria:

  • change all the RuntimeErrors in second_on_omodern_computer() by something nonconstant (e.g. a multiple of 1/bogomips)

This is good solution, but the devil is in the details. BogoMIPS won't be available on Windows, for example. What else can we use?

I meant any stupid benchmark, whatever it is I don't think it's very important. This is just so the default timeout can be reasonably small on fast cpus while being large enough on your slow computer. First silly thing that I could think of:

def secs_approx():
  def _work():
    import time
    start = time.process_time()
    p = 2**16+1
    a = 1
    for i in range(1,p):
        a = a * i % p
    assert a+1 == p
    return time.process_time() - start
  return round(100 * min((_work() for i in range(5))), 2)

print("seconds_on_modern_computer approximation =", secs_approx())

This gives me:

  • 1.43 on a Xeon E5520 @ 2.27GHz (nehalem)
  • 0.81 on a i7-3930K @ 3.20GHz (sandy bridge e)
  • 0.66 on a Xeon Gold 6240 @ 2.60GHz (cascade lake)
  • 0.64 on a i7-9700 @ 3.00GHz (coffee lake)

What do you get? Does it look like a reasonable approximation to seconds_on_modern_computer()?

Would K * secs_approx() be a reasonable timeout? I'm guessing your slow cpu (core 2?) will give more than 2 so K = 300 or less should give you more than 10 minutes while giving 3 minutes or less on recent cpus.

The whole thing is really a mess: when you allow sage to choose your --warn-long value, it's essentially trying to use second_on_modern_computer() to turn the wall-time default into a meaningful CPU-time value.

I fail to see how this is a wall vs cpu time issue. I agree 100% on your switch to cputime, but this is orthogonal: different cpus can do different amounts of work in the same cpu time.

The stupid benchmark above is trying to do a "fixed amount of work" (whatever that means) and measuring the cputime taken gives a rough conversion factor between cpu time and work.

The --timeout value, on the other hand, has to be wall-time, because the doctest runner looks at the wall time to see if the process should be killed yet. Ugh. But at least no harm is done here if we set the value too high.

Fair enough, that's ok on the assumption that each doctest runs in a single thread. AFAIK the doctest runner parallelizes by running multiple independent doctests in parallel and OMP_NUM_THREADS is set to 1 (or 2?). It's still useful to have a reasonable approximate conversion factor as above that depends on the cpu (not on number of threads) so the timeout on a core 2 is still 10 minutes wall time but the timeout on a coffe lake is 2-3 minutes at most.

NOTE: my suggestion above is a REALLY dumb proxy for seconds_on_modern_computer(). We could keep the factor calculated using doctest stats if available. OTOH: this is also broken since the doctest stats are stored in $HOME/.sage without hostname; my $HOME is shared between my nehalem and my coffee lake and the doctest times have a factor of 2 between them so using my stupid benchmark might be better in some cases (I do like to build on coffee lake then test on nehalem so I know if there are illegal instructions).

It might be better to do it in cython (with a larger value of p) to get a more meaningful value (i.e. closer to what sage actually does), but I bet this is already good enough.

@tornaria
Copy link
Contributor

tornaria commented Dec 9, 2021

comment:16

Replying to @orlitzky:

c. Even if test is run with --long, the doctests that are NOT labeled long still emit a warning if they take more time than the warn_long time.

I think this one already happens? The other two are good ideas, I opened #32995 for those.

After your patch in #32981, doctests warn if > 10s when not using --long but they only warn if > 60s when using --long. Hence you will still miss doctests taking 30s if you only run using --long.

I would expect that "non-long" doctests warn at the same default time regardless of --long. Arguably, there could be a different time value to trigger warnings on "long" doctests.

Maybe:

  • keep warn_long configurable as now and default to 10s as in your patch, and apply to non-long doctests regardless of --long
  • for long doctests: either don't warn (they are labeled long, after all) or else have a hardcoded value at 60s (there could be another configuration option, but isn't that overdoing it?).

Also IMHO all default values should be multiples of seconds_on_modern_computer().

@orlitzky
Copy link
Contributor Author

comment:17

Replying to @tornaria:

I meant any stupid benchmark, whatever it is I don't think it's very important. This is just so the default timeout can be reasonably small on fast cpus while being large enough on your slow computer. First silly thing that I could think of:

...

This gives me:
...

What do you get? Does it look like a reasonable approximation to seconds_on_modern_computer()?

I get between 1.4 and 1.5 on my desktop (the slower of the two).

Would K * secs_approx() be a reasonable timeout? I'm guessing your slow cpu (core 2?) will give more than 2 so K = 300 or less should give you more than 10 minutes while giving 3 minutes or less on recent cpus.

Disclaimer: I still think this is overkill for the timeout value, which is intentionally used maybe once a decade. But yes, in general, I think the idea of a fixed amount of work (rather than a fixed amount of time) is preferable.

The whole thing is really a mess: when you allow sage to choose your --warn-long value, it's essentially trying to use second_on_modern_computer() to turn the wall-time default into a meaningful CPU-time value.

I fail to see how this is a wall vs cpu time issue. I agree 100% on your switch to cputime, but this is orthogonal: different cpus can do different amounts of work in the same cpu time.

I know, but wall time is simply the worst. I can mess up the wall time by e.g. running the computation at 2am when daylight savings time switches on/off. CPU time is still flawed, but is always going to be a better metric than wall-time.

The stupid benchmark above is trying to do a "fixed amount of work" (whatever that means) and measuring the cputime taken gives a rough conversion factor between cpu time and work.

There are some things we'll never be able to account for. New versions of python may change how much work is involved in the "fixed" amount of work. Many sage doctests depend on Cython or C libraries whose efficiency depends on CFLAGS, the linker used, etc. Python itself (being written in C) is subject to the same uncertainties.

Still, I think if we're careful, we can do a lot better than the existing second_on_modern_computer().

@orlitzky
Copy link
Contributor Author

comment:18

Replying to @tornaria:

Maybe:

  • keep warn_long configurable as now and default to 10s as in your patch, and apply to non-long doctests regardless of --long
  • for long doctests: either don't warn (they are labeled long, after all) or else have a hardcoded value at 60s (there could be another configuration option, but isn't that overdoing it?).

I think a separate option is the best way to do it cleanly. We'll probably wind up with less code that way than if we add a bunch of special cases to guess at what should happen. (These options are for developers only, so I'm not too worried about overcomplicating the UI.) I added another item to #32995 for this.

Also IMHO all default values should be multiples of seconds_on_modern_computer().

Yes, this makes more sense for --warn-long than it does for --timeout. I've opened #33022 specifically for it.

I propose the following:

  1. Merge this and Use CPU time for --warn-long and a low default limit for non-long tests #32981 to get things working again, if non-optimally.
  2. In Replace second_on_modern_computer() with a fixed amount of work #33022, we can replace second_on_modern_computer() with something that works better, and reintroduce it after it has been tested on enough platforms (basically I just want to see what numbers it comes up with on the GH actions.) Having done Use CPU time for --warn-long and a low default limit for non-long tests #32981 first, the multiplier will affect the CPU time and not wall time -- as it should be.
  3. Use improvements to --warn-long test flag #32995 for the separate long and non-long limits, or whatever else we decide to do to solve that problem.

@mkoeppe mkoeppe modified the milestones: sage-9.5, sage-9.6 Jan 10, 2022
@mkoeppe mkoeppe modified the milestones: sage-9.6, sage-9.7 May 3, 2022
@mkoeppe mkoeppe modified the milestones: sage-9.7, sage-9.8 Sep 19, 2022
@mkoeppe mkoeppe removed this from the sage-9.8 milestone Jan 29, 2023
vbraun pushed a commit to vbraun/sage that referenced this issue Dec 4, 2023
…t timeout

    
When running the test suite on an older machine, many files time out.
For example,

```
$ sage -t src/sage/manifolds/differentiable/tensorfield.py
...
File "src/sage/manifolds/differentiable/tensorfield.py", line 248,
in sage.manifolds.differentiable.tensorfield.TensorField
Warning: Consider using a block-scoped tag by inserting the line
'sage: # long time' just before this line to avoid repeating the
tag 4 times
  s = t(a.restrict(U), b) ; s  # long time
  Timed out (and interrupt failed)
...
----------------------------------------------------------------------
Total time for all tests: 360.3 seconds
  cpu time: 0.0 seconds
  cumulative wall time: 0.0 seconds
```

This has run over the default (non-long) test timeout of 300s. This
commit doubles that default, a change that should be unobjectionable for
a few reasons:

  1. This timeout is a last line of defense intended to keep the test
suite from looping forever when run unattended. For that purpose, ten
minutes is as good as five.

  2. As more tests get added to each file, those files take longer to
test on the same hardware. It should therefore be expected that we will
sometimes need to increase the timeout. (Basically, if anyone is hitting
it, it's too low.)

  3. We now use Github CI instead of patchbots for most automated
testing, and Github has its own timeout.

  4. There is a separate mechanism, --warn-long, intended to catch tests
that run for too long. The test timeout should not be thought of as a
solution to that problem.

Closes: sagemath#32973
    
URL: sagemath#36223
Reported by: Michael Orlitzky
Reviewer(s):
vbraun pushed a commit to vbraun/sage that referenced this issue Dec 5, 2023
…t timeout

    
When running the test suite on an older machine, many files time out.
For example,

```
$ sage -t src/sage/manifolds/differentiable/tensorfield.py
...
File "src/sage/manifolds/differentiable/tensorfield.py", line 248,
in sage.manifolds.differentiable.tensorfield.TensorField
Warning: Consider using a block-scoped tag by inserting the line
'sage: # long time' just before this line to avoid repeating the
tag 4 times
  s = t(a.restrict(U), b) ; s  # long time
  Timed out (and interrupt failed)
...
----------------------------------------------------------------------
Total time for all tests: 360.3 seconds
  cpu time: 0.0 seconds
  cumulative wall time: 0.0 seconds
```

This has run over the default (non-long) test timeout of 300s. This
commit doubles that default, a change that should be unobjectionable for
a few reasons:

  1. This timeout is a last line of defense intended to keep the test
suite from looping forever when run unattended. For that purpose, ten
minutes is as good as five.

  2. As more tests get added to each file, those files take longer to
test on the same hardware. It should therefore be expected that we will
sometimes need to increase the timeout. (Basically, if anyone is hitting
it, it's too low.)

  3. We now use Github CI instead of patchbots for most automated
testing, and Github has its own timeout.

  4. There is a separate mechanism, --warn-long, intended to catch tests
that run for too long. The test timeout should not be thought of as a
solution to that problem.

Closes: sagemath#32973
    
URL: sagemath#36223
Reported by: Michael Orlitzky
Reviewer(s):
@vbraun vbraun closed this as completed in 0737944 Dec 6, 2023
@mkoeppe mkoeppe added this to the sage-10.3 milestone Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants