Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output_len and _into to snappy raw #33

Closed
martindurant opened this issue Feb 19, 2021 · 4 comments
Closed

Add output_len and _into to snappy raw #33

martindurant opened this issue Feb 19, 2021 · 4 comments

Comments

@martindurant
Copy link

Snappy framed compress allows output_len and has the _into variants, but snappy_raw does not. Would be nice!

btw: am I right in thinking

  • snappy.compress->cramjam.snappy.compress_raw
  • snappy.StreamCompressor->cramjam.snappy.compress ?
    (snappy.stream_compress uses StreamCompressor, but works on file-likes)
@milesgranger
Copy link
Owner

It seems like we can do that; and this is where the correct use of decompress_len as mentioned in #35 belongs. When we add this functionality we will have the following scenario:

  1. User provided bytearray (with or without output_len, as we can also use decompress_len for the estimate):
    • We can de/compress then resize the resulting bytearray to the actual size if needed. Super!
  2. User provided bytes
    • If they also provided output_len we're good for de/compression.
    • It appears decompress_len gives an exact answer; so long as that is successful we can do the single allocation for decompression
    • max_compress_len gives the max compressed size, in this case, they would likely get trailing null bytes back

and the _into addition for raw obviously doesn't matter for these points, so all good there.

For the second part, cramjam tries to follow the layout from the Rust crate it uses for snappy. For example snappy.de/compress_raw uses the de/encoders from the snap::raw module.

@martindurant
Copy link
Author

I'll just give a 👍 in here - snappy-raw is the most important one to get fast from parquet's point of view. I'm happy about the naming convention.

@milesgranger
Copy link
Owner

I don't know if we can get de/compress_raw functions to support output_len, as the raw de/compression functions there only output a new buffer, unlike the others which can take any writeable object. I can dig into the src later and see if it would be possible even, but suspect the one who wrote it has a good reason as the other portions of the crate do implement reader/writer parameters for framed de/compression.

The PR referenced here does implement the de/compress_raw_into, so I hope that is good for you in the mean time.

Would also point out that, while I don't know what data sizes you're working with, in the benchmarks the current de/compress_raw variants are extremely close with python-snappy and even edge it out in a couple of cases.

@milesgranger
Copy link
Owner

de/compress_raw_into now follows the same API as other variants from #45 , and de/compress_raw supports output_len as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants