Performance issue on big projects #53

boxed · 2016-12-07T14:01:39Z

We have a big project with a big test suite. When starting pytest with testmon enabled it takes something like 8 minutes just to start when running (almost) no tests. A profile dump reveals this:

Wed Dec  7 14:37:13 2016    testmon-startup-profile

         353228817 function calls (349177685 primitive calls) in 648.684 seconds

   Ordered by: cumulative time
   List reduced from 15183 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  648.707  648.707 env/bin/py.test:3(<module>)
 10796/51    0.006    0.000  648.614   12.718 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py:335(_hookexec)
 10796/51    0.017    0.000  648.614   12.718 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py:332(<lambda>)
 11637/51    0.063    0.000  648.614   12.718 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py:586(execute)
        1    0.000    0.000  648.612  648.612 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/config.py:29(main)
  10596/2    0.016    0.000  648.612  324.306 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py:722(__call__)
        1    0.000    0.000  562.338  562.338 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/pytest_testmon.py:80(pytest_cmdline_main)
        1    0.000    0.000  562.338  562.338 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/pytest_testmon.py:70(init_testmon_data)
        1    0.004    0.004  562.338  562.338 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/testmon_core.py:258(read_fs)
     4310    1.385    0.000  545.292    0.127 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/testmon_core.py:224(test_should_run)
     4310    3.995    0.001  542.647    0.126 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/testmon_core.py:229(<dictcomp>)
  4331550   54.292    0.000  538.652    0.000 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/process_code.py:104(checksums)
        1    0.039    0.039  537.138  537.138 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/testmon_core.py:273(compute_unaffected)
 73396811   67.475    0.000  484.571    0.000 /Users/andersh/triresolve/env/lib/python2.7/site-packages/testmon/process_code.py:14(checksum)
 73396871  360.852    0.000  360.852    0.000 {method 'encode' of 'str' objects}
        1    0.000    0.000   83.370   83.370 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/main.py:118(pytest_cmdline_main)
        1    0.000    0.000   83.370   83.370 /Users/andersh/triresolve/env/lib/python2.7/site-packages/_pytest/main.py:118(pytest_cmdline_main)

as you can see the last line is 80 seconds cumulative, but the two lines above are 360 and 484 respectively.

This hurts our use case a LOT, and since we use a reference .testmondata file that has been produced by a CI job, it seems excessive (and useless) to recalculate this on each machine when it could be calculated once up front.

So, what do you guys think about caching this data in .testmondata?

The text was updated successfully, but these errors were encountered:

tarpas · 2016-12-07T18:03:09Z

I actually did a lot of optimization for big projects and very few changes. This might be a regression.

Does it also happen if you create and consume .testmondata on the same machine? the read_fs function, shouldn't do any source code processing and checksums if the file modification time on file system matches the modification time stored in .testmondata

Does it happen on second run too?

The source code crunching and checksums is a joke regarding effectivity. It was built in a way that allowed me to learn AST lib and the problem space in general.

I actually find it hard to believe that the str.encode is the slow operation in all the string manipulation and all the loops going on there.

boxed · 2016-12-07T20:26:35Z

I actually did a lot of optimization for big projects and very few changes. This might be a regression. Does it also happen if you create and consume .testmondata on the same machine? the read_fs function, shouldn't do any source code processing and checksums if the file modification time on file system matches the modification time stored in .testmondata

Oh. That's crucial data. Could probably go into the SQLite SN and set the modification dates so I avoid this problem :P I already run through it to change the absolute paths to correspond to the paths on developers machines instead of the CI machine (and I've opened an issue for that too :P) so should be simple to hack around. It was two days ago I did it locally so don't remember clearly. Will try tomorrow.

Does it happen on second run too?

I refetched the database always so that gets screwed up! Will look at that. It's definitely faster second time the file system changes are detected when I run with ptw.

The source code crunching and checksums is a joke regarding effectivity. It was built in a way that allowed me to learn AST lib and the problem space in general.

Heh. I skimmed the code and it seems pretty complex to me. I honestly didn't understand why, what or how it's doing things so prepare for some stupid questions: Couldn't you basically do a git stat if it's a git dir? I've also recently worked with baron (as part of my mutation testing library mutmut) and it creates ASTs that cleanly round trip back to code so you can get exact line numbers from the AST for nodes. Would that help?

tarpas · 2016-12-07T20:35:02Z

Now I noticed #52 . I think that's the cause of this issue. testmon_data.mtimes also stores absolute paths. mtimes is the optimization to avoid parsing the whole source tree in case the files barely changed

boxed · 2016-12-08T07:58:54Z

Ok. I'll check into handling this too. I shouldn't of be able to hack around the modification time issue. Probably by just using some data from git and setting the modification time on the files directly, or modifying the database based on the git data. Either way it seems fairly simple to handle.

tarpas · 2016-12-08T08:17:03Z

f533c2a partially addresses this issue

boxed · 2016-12-08T10:16:20Z

The overhead of encode is pretty terrifying too. In python 2 it's also, as far as I understand, 100% redundant. Does this mean we can optimize the python 2 path nicely but the python 3 version must be super slow? Or is utf8 encoding of str in python 3 super fast?

Something seems weird here :P

tarpas · 2016-12-09T08:56:19Z

@boxed reports that this is not solved in #54 . The solution then probably is making the file change detection in 2 stages:

modified time,
if changed
checksum of the whole file
if changed, then proceed to the expensive parsing and blocks comparison, etc.

…use case where modified times changed but contents of files didn't. re #53

tarpas · 2016-12-16T07:59:41Z

@boxed Could you confirm this works for you and there is no obvious regression? (it's in master). It think it's working and it's ready for release.

boxed · 2016-12-17T07:06:19Z

I'm currently away from work sick and have only next week left before vacation plus parental leave so either I can test this next week or in march(!). I am not feeling good about my chances of getting back to work next week (the kids and wife also needs to get well plus me) so I say release now. I looked at the change a bit before and it seems pretty straight forward to me. Great job on all this!

tarpas mentioned this issue Dec 8, 2016

Refactor #54

Merged

tarpas added a commit that referenced this issue Dec 16, 2016

SourceTree refactored out of TestmonData, checksums now used for the …

9ba9d4e

…use case where modified times changed but contents of files didn't. re #53

tarpas closed this as completed Dec 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue on big projects #53

Performance issue on big projects #53

boxed commented Dec 7, 2016

tarpas commented Dec 7, 2016 •

edited

Loading

boxed commented Dec 7, 2016 via email

tarpas commented Dec 7, 2016

boxed commented Dec 8, 2016 via email

tarpas commented Dec 8, 2016

boxed commented Dec 8, 2016

tarpas commented Dec 9, 2016 •

edited

Loading

tarpas commented Dec 16, 2016

boxed commented Dec 17, 2016 via email

Performance issue on big projects #53

Performance issue on big projects #53

Comments

boxed commented Dec 7, 2016

tarpas commented Dec 7, 2016 • edited Loading

boxed commented Dec 7, 2016 via email

tarpas commented Dec 7, 2016

boxed commented Dec 8, 2016 via email

tarpas commented Dec 8, 2016

boxed commented Dec 8, 2016

tarpas commented Dec 9, 2016 • edited Loading

tarpas commented Dec 16, 2016

boxed commented Dec 17, 2016 via email

tarpas commented Dec 7, 2016 •

edited

Loading

tarpas commented Dec 9, 2016 •

edited

Loading