Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for LZMA / XZ codec #127

Merged
merged 3 commits into from
Jan 22, 2024
Merged

Experimental support for LZMA / XZ codec #127

merged 3 commits into from
Jan 22, 2024

Conversation

milesgranger
Copy link
Owner

@milesgranger milesgranger commented Jan 21, 2024

Part of #126

Adds experimental lzma / xz support under the experimental module, with a limited amount of configuration, only being able to set preset for compression is all.

67f6902 will close #123 (hopefully) :)


For this, it creates a byte-for-byte mirror of the builtin python module using defaults.

In [1]: import lzma

In [2]: import cramjam

In [3]: compressed = lzma.compress(b'bytes')

In [4]: bytes(cramjam.experimental.lzma.compress(b'bytes'))
Out[4]: b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\x04bytes\x00\x00\x00\x006\x93\x11\xb1PA\x11\xab\x00\x01\x1d\x05\xb8-\x80\xaf\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'

In [5]: compressed
Out[5]: b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\x04bytes\x00\x00\x00\x006\x93\x11\xb1PA\x11\xab\x00\x01\x1d\x05\xb8-\x80\xaf\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'

TODO:

  • At least decode support for legacy LZMA format.
  • Multi stream support
  • Expose more configuration settings?

@milesgranger milesgranger marked this pull request as draft January 21, 2024 13:07
@milesgranger milesgranger mentioned this pull request Jan 21, 2024
@lgray
Copy link

lgray commented Jan 21, 2024

@milesgranger the present implementation works in our case, thanks! Strangely hinting the output buffer size doesn't seem to bring any performance improvement compared to the python standard library implementation. Perhaps the data I'm testing isn't large enough to see it. Maybe there's a more optimized lzma implementation out there in rust land, but that can come in time.

Still - seems fit to task! Thanks for the snappy response!

Small update: there appears to be a ~10% improvement for our data, when testing with a larger file. Not huge, but I'll take it.

@milesgranger milesgranger marked this pull request as ready for review January 22, 2024 06:25
@milesgranger milesgranger merged commit 2d710c7 into master Jan 22, 2024
65 checks passed
@milesgranger milesgranger deleted the support-lzma branch January 22, 2024 06:28
@milesgranger
Copy link
Owner Author

Thanks for the feedback @lgray, will be happy to know what you think of follow-ups as time permits. Good there was a bit of improvement, but wasn't expecting anything amazing. I think they both use the same underlying liblzma under the hood.

Probably will benefit from cramjam's de/compress_into functions if you're able to work that out and re-use buffers in your use case.

@lgray
Copy link

lgray commented Jan 22, 2024

@milesgranger sure - @ me when you post follow ups. I can test them fairly quickly.

For (de)compress_into I'll have to tinker with the uproot library a bit more, but it's typically organized very sensibly so I should be able to use those methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python test test_variants_different_dtypes[brotli] sometimes times out
2 participants