Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller block size and add _col2LastId to block metadata #917

Merged
merged 9 commits into from
Apr 4, 2023

Conversation

hannahbast
Copy link
Member

@hannahbast hannahbast commented Mar 19, 2023

1. The block size used to be 1 << 23 (over 8M), which is too large, since we always need to decompress at least one whole block, even when reading only few triples. It's now 100'000, which still has a relatively small overall space consumption.

2. Add member _col2LastId to block data because we will need it for #916 and it's nice if our live indexes already have this information so that we can play around with this PR without having to rebuild indexes.

3. Unrelated fix in IndexTestHelpers.h: The test TTL file was not deleted after the test, now it is.

Hannah Bast added 2 commits March 19, 2023 20:48
1. The block size used to be `1 << 23` (over 8M), which is too large,
   since we always need to decompress at least one whole block, even
   when reading only few triples. It's now 100'000, which still has a
   small relatively small overall space consumption.

2. Add member `_col2LastId` to block data because we need it for the
   delta triples (#916).
1. The test in `IndexMetaDataTest` needs to be adapted to the addition
   of `_col2LastId`.

2. Unrelated fix in `IndexTestHelpers.h`: The test TTL file was not
   deleted after the test, now it is.
@codecov
Copy link

codecov bot commented Mar 19, 2023

Codecov Report

Patch coverage: 92.07% and project coverage change: -0.02 ⚠️

Comparison is base (8b65e4d) 71.86% compared to head (60be15f) 71.85%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #917      +/-   ##
==========================================
- Coverage   71.86%   71.85%   -0.02%     
==========================================
  Files         243      243              
  Lines       23860    23863       +3     
  Branches     3083     3083              
==========================================
- Hits        17148    17146       -2     
- Misses       5392     5395       +3     
- Partials     1320     1322       +2     
Impacted Files Coverage Δ
src/index/ConstantsIndexBuilding.h 100.00% <ø> (ø)
src/index/MetaDataHandler.h 76.57% <45.45%> (ø)
src/index/CompressedRelation.cpp 98.16% <96.82%> (+0.01%) ⬆️
src/index/CompressedRelation.h 100.00% <100.00%> (ø)
src/index/IndexMetaDataImpl.h 97.95% <100.00%> (ø)

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@hannahbast hannahbast requested a review from joka921 March 19, 2023 20:41
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.
Maybe we could build an index for one of the larger collections before merging this to evaluate the impact of the different block sizes.

@hannahbast
Copy link
Member Author

hannahbast commented Mar 20, 2023

Thank you very much. Maybe we could build an index for one of the larger collections before merging this to evaluate the impact of the different block sizes.

I very much agree! I have already built several indexes for DBLP with a variety of block sizes, where 100K looked like a good compromise with essentially the same total size of the .index.* files compared to the old block size of 8.4M. I also built an index for Freebase, where a block size of 100K leads to a total size of these files of 93 GB, compared to only 79 GB for the old block size. But Freebase is a bit special with the many predicates + the old index also had the bug with the wrong block sizes for SPO, SOP, OSP, and OPS. I am currently building a Wikidata index with the new block size, it should be finished in the early afternoon.

@hannahbast
Copy link
Member Author

hannahbast commented Mar 21, 2023

@joka921 We used to have some version bytes in our index files or metadata so that the server gives a proper error message when the index does not have the right format (instead of a cryptic ERROR: vector::_M_default_append message).

Is this mechanism still in place and working? If yes, how does it work? In the simplest case, it should simply be increasing a version number when making a change to the index format that is not backward-compatible, right?

PS: I am currently building a new Wikidata index with block size 1M and STXXL_MEMORY_GB = 20. It should be finished in the early afternoon. I am curious whether the index build will be faster than the previous one and how it will perform.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, feel free to merge this.

@joka921 joka921 merged commit 499c612 into master Apr 4, 2023
@joka921 joka921 deleted the smaller-blocks branch April 4, 2023 09:28
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 4, 2023

Please retry analysis of this Pull-Request directly on SonarCloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants