[pex] Memoize calls to Crawler.crawl() for performance win in find-links based resolution. #187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While investigating a user-reported performance issue in the creation of pex files (via pants), profiling revealed around ~60% of the total time being spent in pex's Crawler.crawl() function with over 47 calls. Note that calls to Crawler.crawl() involve use of the
re
module to match html tags from a given index page (in our case, with roughly ~5000+ files) - with excessive application of there
module being a known culprit for slowness.I was able to approximately repro the same scenario directly with pex:
..and further inspection here revealed upwards of 24 non-cached Crawler.crawl() calls just for these 5 dependencies.
This PR memoizes calls to Crawler.crawl() to save overhead on subsequent calls for a ~2x/200% speedup in the above test case:
before
after