-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster decompression of gzip files #95534
Comments
PRs welcome! |
Plus, would you like to provide a microbenchmark by using pyperf? |
Thank you for your enthusiasm! I will put it on my todo list right away. Sure I can look and see if I can provide a microbenchmark.I have to say though: these changes really manifest when decompressing gzip files with bioinformatics data. Normal sizes for us are 10-20GB for Whole Exome Sequencing or RNA Sequencing, with ~100GB for Whole Genome Sequencing. So that is "Real-World" gzip data for me (you can see why this topic has my interest). That is not really "micro" though. I will look around for a suitable real-world case. I am currently thinking of decompressing tar.gz files. |
I made a PR. Microbenchmarks and results here for the interested:
So a 10% performance improvement. Given that most of the work is done in the zlib library, this is a substantial reduction of overhead costs. |
Change summary: + There is now a `gzip.READ_BUFFER_SIZE` constant that is 128KB. Other programs that read in 128KB chunks: pigz and cat. So this seems best practice among good programs. Also it is faster than 8 kb chunks. + a zlib._ZlibDecompressor was added. This is the _bz2.BZ2Decompressor ported to zlib. Since the zlib.Decompress object is better for in-memory decompression, the _ZlibDecompressor is hidden. It only makes sense in file decompression, and that is already implemented now in the gzip library. No need to bother the users with this. + The ZlibDecompressor uses the older Cpython arrange_output_buffer functions, as those are faster and more appropriate for the use case. + GzipFile.read has been optimized. There is no longer a `unconsumed_tail` member to write back to padded file. This is instead handled by the ZlibDecompressor itself, which has an internal buffer. `_add_read_data` has been inlined, as it was just two calls. EDIT: While I am adding improvements anyway, I figured I could add another one-liner optimization now to the python -m gzip application. That read chunks in io.DEFAULT_BUFFER_SIZE previously, but has been updated now to use READ_BUFFER_SIZE chunks.
Thanks for doing this! |
* main: (31 commits) pythongh-95913: Move subinterpreter exper removal to 3.11 WhatsNew (pythonGH-98345) pythongh-95914: Add What's New item describing PEP 670 changes (python#98315) Remove unused arrange_output_buffer function from zlibmodule.c. (pythonGH-98358) pythongh-98174: Handle EPROTOTYPE under macOS in test_sendfile_fallback_close_peer_in_the_middle_of_receiving (python#98316) pythonGH-98327: Reduce scope of catch_warnings() in _make_subprocess_transport (python#98333) pythongh-93691: Compiler's code-gen passes location around instead of holding it on the global compiler state (pythonGH-98001) pythongh-97669: Create Tools/build/ directory (python#97963) pythongh-95534: Improve gzip reading speed by 10% (python#97664) pythongh-95913: Forward-port int/str security change to 3.11 What's New in main (python#98344) pythonGH-91415: Mention alphabetical sort ordering in the Sorting HOWTO (pythonGH-98336) pythongh-97930: Merge with importlib_resources 5.9 (pythonGH-97929) pythongh-85525: Remove extra row in doc (python#98337) pythongh-85299: Add note warning about entry point guard for asyncio example (python#93457) pythongh-97527: IDLE - fix buggy macosx patch (python#98313) pythongh-98307: Add docstring and documentation for SysLogHandler.createSocket (pythonGH-98319) pythongh-94808: Cover `PyFunction_GetCode`, `PyFunction_GetGlobals`, `PyFunction_GetModule` (python#98158) pythonGH-94597: Deprecate child watcher getters and setters (python#98215) pythongh-98254: Include stdlib module names in error messages for NameErrors (python#98255) Improve speed. Reduce auxiliary memory to 16.6% of the main array. (pythonGH-98294) [doc] Update logging cookbook with an example of custom handling of levels. (pythonGH-98290) ...
Pitch
Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip support is advertised via headers). Tar.gz files are an extremely common way to archive files. Zip files use internal gzip compression.
Speeding this up by a non-trivial amount is therefore very advantageous.
Feature or enhancement
The current gzip reading pipeline can be improved quite a lot. This is the current way of doing things:
io.DEFAULT_BUFFER_SIZE
of data from a _PaddedFile objectzlib.decompressobj()
using thedecompress(raw_data, size)
function.This has some severe disadvantages when reading large blocks:
This also has some severe disadvantages when reading small blocks.
How to improve this:
needs_input
attribute is True. This prevents querying the _PaddedFile object too much.This prevents a lot of calls to the Python memory allocator.
This restructuring has already been implemented in python-isal. That project is a modification of zlibmodule.c to use the ISA-L optimizations. While this did improve speed, I also looked at other ways to improve the performance. By restructuring the gzip module and the zlib code the Python overhead was significantly reduced.
Relevant code:
Most of this code can be seemlessly copied back into CPython. Which I will do when I have the time. This can best be done after the 3.11 release I think.
Previous discussion
NA. This is a performance enhancement, so not necessarily a new feature, but also not a bug.
The text was updated successfully, but these errors were encountered: