Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [RFC] add cryptographic hash to seekable format #2737

Closed
wants to merge 1 commit into from

Conversation

tomberek
Copy link

@tomberek tomberek commented Jul 31, 2021

very much WIP, first time working on zstd, looking for feedback/thoughts

My intention is explicitly to trade some compression ratio in order to use the seekable format and perform content-addressing of each frame's uncompressed content. Perhaps a --rysyncable-frames mode? Adding a cryptographic checksum into the Seek Table would produce a format very similar to zchunk/casync/zsync but still be a valid .zst. This optimized format (similar to the niche served by Cloud Optimized GeoTIFF) could be hosted by a server supporting HTTP Range requests and clients can easily perform dedup, and binary diffs.

  • update implementation zstdseek_compress.c, (status: PoC functioning)
  • update implementation zstdseek_decompress.c (status: nothing)
  • update examples/seekable_compression.c (status: framesize-based PoC functioning)
  • provide metrics (roughly seems to increase size, including the new index, by ~10%)

Alternatively: can make a "Checksum Type" field out of a bit or two.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 4, 2024

The idea is interesting but likely needs to be fleshed out.
Closing, due to lack of activity.

@Cyan4973 Cyan4973 closed this Mar 4, 2024
@tomberek
Copy link
Author

Still interested, but need assistance.

@silvanshade
Copy link

I'm also interested in this for chonker.

In that library I define a format which is essentially content-defined chunked slices of bytes compressed with zstd and hashed with BLAKE3:

It is similar to bita but about 4x as fast (due to use of rayon/zstd/BLAKE3/rkyv instead of bita's brotli/BLAKE2/protobuf).

I would consider using BLAKE3 over SHA256 or SHA512 since it can be significantly faster, even in single-threaded form (even when the SHA256 and SHA512 are hardware accelerated). On Zen4 I've seen around a 3-4x increase, depending on input size. And for multi-threaded BLAKE3, it can scale almost linearly up to memory bandwidth.

Having a hashed seekable format at the zstd level would make this approach much nicer, and open up more use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants