-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer compression writers #78
Conversation
This makes writing in small units such as lines or FASTQ records much faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I had some time to look at this (but I did not re-do your measurements).
This is a bit strange because older xopen versions actually wrapped GzipFile
s in io.BufferedWriter
, and I remember it was quite important to have that for Python 2.7. I removed this in ce7af05 and 8fba6c6. I don’t know anymore why I made the latter commit, but I tend to think it was because repr(gzip.open("file.gz", "w"))
results in "<gzip _io.BufferedWriter name='file.gz' 0x7f31d6c2ea60>"
, which led me to think that there’s no need to wrap this in another BufferedWriter
anymore. But that is wrong as I see now: The BufferedWriter
in that message refers to the "inner" GzipFile.fileobj
with the raw compressed data.
Good that you caught this! Not merging right now because I have a comment.
What is your thought regarding the Python bug you opened? Maybe that’s too easy, but shouldn’t gzip.open(..., "wb")
do exactly what you do here, that is, wrap GzipFile
in an io.BufferedWriter
? It would be quite symmetric to what regular open("file", "wb")
does – that already returns an io.BufferedWriter
.
Yes I considered that but I didn't do it because it creates a massive problem with backwards-compatibility. |
Done. Thanks for the review! |
A fix is now also available upstream: python/cpython#101251 |
Hi I noticed some weird performance characteristics when using single-threaded compressed writing during developing fastq filter.
These characteristics can easily be reproduced with the following script:
This will default to using IGzipFile when threads=0 and use a PipedPythonIsalWriter when threads=1.
Benchmarks:
No compression for comparison:
Using a piped python-isal writer:
User time + system time is roughly 9.4 seconds. 6.6 seconds more than writing to an uncompressed file. That is fair for a 1.6 GB file.
But now single threaded:
That is a tremendous increase. This is due to GzipFile.write() being tremendously expensive. A compress action is done, but also the length and the checksum are updated. There is a lot of overhead in python calls.
By using an io.BufferedWriter, we create a buffer. And
write
is only called every 8 KB. That makes a huge difference:Note that this time, the single-threaded solution uses less CPU overall (but more wall clock time). This is the behavior that is expected, where multithreading generally has some overhead compared to single thread.
The change is not very invasive and should increase performance across a range of applications that write small records. Most notably cutadapt, because fastq records are generally 335 bytes or so when using standard illumina 150bp reads.
EDIT: I noticed that on cutadapt 3.5 single-threaded xopen mode is impossible to achieve as xopen(..., threads=0) can not ever be called with the current logic (it calls threads=1 when cores==1 and when cores==0 it uses all available cores so xopen is called with threads=4). So cutadapt won't notice this change. I did make a PR rectifying that. (Single-thread FTW!)