-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smaller block size and add _col2LastId to block metadata #917
Conversation
1. The block size used to be `1 << 23` (over 8M), which is too large, since we always need to decompress at least one whole block, even when reading only few triples. It's now 100'000, which still has a small relatively small overall space consumption. 2. Add member `_col2LastId` to block data because we need it for the delta triples (#916).
1. The test in `IndexMetaDataTest` needs to be adapted to the addition of `_col2LastId`. 2. Unrelated fix in `IndexTestHelpers.h`: The test TTL file was not deleted after the test, now it is.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #917 +/- ##
==========================================
- Coverage 71.86% 71.85% -0.02%
==========================================
Files 243 243
Lines 23860 23863 +3
Branches 3083 3083
==========================================
- Hits 17148 17146 -2
- Misses 5392 5395 +3
- Partials 1320 1322 +2
... and 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much.
Maybe we could build an index for one of the larger collections before merging this to evaluate the impact of the different block sizes.
I very much agree! I have already built several indexes for DBLP with a variety of block sizes, where 100K looked like a good compromise with essentially the same total size of the |
@joka921 We used to have some version bytes in our index files or metadata so that the server gives a proper error message when the index does not have the right format (instead of a cryptic Is this mechanism still in place and working? If yes, how does it work? In the simplest case, it should simply be increasing a version number when making a change to the index format that is not backward-compatible, right? PS: I am currently building a new Wikidata index with block size 1M and STXXL_MEMORY_GB = 20. It should be finished in the early afternoon. I am curious whether the index build will be faster than the previous one and how it will perform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, feel free to merge this.
…with `trailingUnderscores_`
Please retry analysis of this Pull-Request directly on SonarCloud. |
1. The block size used to be
1 << 23
(over 8M), which is too large, since we always need to decompress at least one whole block, even when reading only few triples. It's now 100'000, which still has a relatively small overall space consumption.2. Add member
_col2LastId
to block data because we will need it for #916 and it's nice if our live indexes already have this information so that we can play around with this PR without having to rebuild indexes.3. Unrelated fix in
IndexTestHelpers.h
: The test TTL file was not deleted after the test, now it is.