Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize performance when Git is used as storage repository #1121

Merged
merged 11 commits into from
Nov 27, 2024
Merged

Conversation

Ndpnt
Copy link
Member

@Ndpnt Ndpnt commented Nov 26, 2024

This PR implements several optimizations to improve performance and memory usage:

Technical improvements:

  • Replace wildcard patterns in git log commands with specific paths
  • Enable commit-graph for faster commit history traversal (fixes Consider taking advantage of git commit-graph #1101)
  • Release HTML/PDF content from memory after term extraction to reduce memory footprint
  • Remove unnecessary commit hooks

Performance benchmarks:

Testing was performed on two distinct datasets:

ToS;DR Collection

Operation Before After Improvement
Extraction 3h 17m 42s 42s 99.6% faster
Tracking 1h 19m 8s 4m 52s 93.8% faster

P2B Compliance Collection

Operation Before After Improvement
Extraction 10m 50s 1m 25s 87.0% faster
Tracking 16m 15s 11m 15s 30.8% faster

The optimizations show greater impact on collections with larger numbers of declared services and terms.

Fixes #1101. Special thanks to @michielbdejong for reporting the issue, investigating, and providing valuable suggestions 🙏

@Ndpnt Ndpnt requested a review from MattiSG November 26, 2024 16:00
Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmarks, this looks amazing!
Thanks a lot @michielbdejong for the experiments and suggestions, I'm thrilled to see this land and am looking forward to the impact this will have on tracking performance across collections!

CHANGELOG.md Outdated Show resolved Hide resolved
return this.recordVersion(terms, extractOnly);
await this.recordVersion(terms, extractOnly);

terms.sourceDocuments.forEach(sourceDocument => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just do terms.sourceDocuments = [] or terms.sourceDocuments = null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because doing so would result in losing information needed for the next run, such as location, contentSelectors, …

Co-authored-by: Matti Schneider <[email protected]>
@Ndpnt Ndpnt merged commit 78d0153 into main Nov 27, 2024
8 checks passed
@Ndpnt Ndpnt deleted the performance branch November 27, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider taking advantage of git commit-graph
2 participants