Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizon Lite: Come up with better index and meta compression scheme #4497

Closed
Tracked by #4571
2opremio opened this issue Aug 2, 2022 · 2 comments
Closed
Tracked by #4571

Comments

@2opremio
Copy link
Contributor

2opremio commented Aug 2, 2022

Stored ledger metadata and more so indexes are occupying a lot of space:

The full metadata files occupies ~8TB for which we don't use a compression scheme.

A preliminary test by @Shaptic of indices built across 100 checkpoints (6400 ledgers) tells us the following:

  • 1.2GiB in size
  • most individual indices were <= 250 bytes
  • If compressing the entire set of indices into a .tar.gz file, the size is reduced by ~44%. Note that this is different than compressing individual indices (which we already do)

Extrapolating this to a year of history (which comes with some big assumptions, like linear growth of indices with history) gives us ~1TB of raw indexes.

(Details and caveats are captured in this Slack thread. We can update this once a larger build is complete.)


@2opremio predicts we may be able to do much better with zstd, using a common index for all files: https://github.com/facebook/zstd#the-case-for-small-data-compression. This will allow us to:

  1. Compress/decompress faster (since the dictionary is precomputed and can be cached)
  2. Increase compression ratio
  3. I think it would still allow us to make range queries within the compressed data (this was a concern from @bartekn). Since the dictionary is separate we can just point to a zstd frame and offset. See https://datatracker.ietf.org/doc/html/rfc8878#section-3.1

On the other hand, we would need to book-keep a separate compression dictionary which requires re-generating the files whenever we update it.

@Shaptic
Copy link
Contributor

Shaptic commented Aug 2, 2022

Training works if there is some correlation in a family of small data samples. The more data-specific a dictionary is, the more efficient it is (there is no universal dictionary). Hence, deploying one dictionary per type of data will provide the greatest benefits.

Per the discussion thread re: dictionary churn, maybe we don't need to train it more than once (or at most occasionally). One training session on a block of history (or less? idk) would be representative of "account activity" which is what indices represent.


As a separate idea, maybe we can fork + modify roaring bitmaps (or sroar) to add the "NextActive" functionality that we need. Alternatively, we can convert between ours and roaring for on-disk storage, though conversion may eat up a non-trivial amount of request latency.

@bartekn
Copy link
Contributor

bartekn commented Aug 3, 2022

zstd training is super interesting and I really wonder how it will improve the situation for us. But I think that maybe we should start with something big and then iterate in future versions of Horizon. I'm pretty sure SDF will be the only org hosting indexes for some time anyway. This discussion makes me think how important it is for us to version meta archives from the initial version.

@Shaptic Shaptic mentioned this issue Sep 1, 2022
7 tasks
@Shaptic Shaptic changed the title Horizon light: come up with better index and meta compression scheme Horizon Lite: Come up with better index and meta compression scheme Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants