[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187

kwlzn · 2015-12-17T00:29:33Z

While investigating a user-reported performance issue in the creation of pex files (via pants), profiling revealed around ~60% of the total time being spent in pex's Crawler.crawl() function with over 47 calls. Note that calls to Crawler.crawl() involve use of the re module to match html tags from a given index page (in our case, with roughly ~5000+ files) - with excessive application of the re module being a known culprit for slowness.

I was able to approximately repro the same scenario directly with pex:

pex --disable-cache --no-pypi -f http://$URL -f http://$URL/dist \
    pex psutil requests ansible jsonschema -o /tmp/throwaway

..and further inspection here revealed upwards of 24 non-cached Crawler.crawl() calls just for these 5 dependencies.

This PR memoizes calls to Crawler.crawl() to save overhead on subsequent calls for a ~2x/200% speedup in the above test case:

before

real    0m19.943s
user    0m5.437s
sys     0m2.427s

after

real    0m8.288s
user    0m3.420s
sys     0m1.986s

…sed resolution.

kwlzn · 2015-12-17T23:45:59Z

merged @ fcdee8a

kwlzn force-pushed the kwlzn/pex/memoize_crawl branch 2 times, most recently from 36c361c to 3f3fb4a Compare December 17, 2015 07:17

Memoize calls to Crawler.crawl() for performance win in find-links ba…

4f0e54e

…sed resolution.

kwlzn force-pushed the kwlzn/pex/memoize_crawl branch from 3f3fb4a to 4f0e54e Compare December 17, 2015 07:20

kwlzn closed this Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187

[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187

kwlzn commented Dec 17, 2015

kwlzn commented Dec 17, 2015

[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187

[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187

Conversation

kwlzn commented Dec 17, 2015

before

after

kwlzn commented Dec 17, 2015