Buffer compression writers #78

rhpvorderman · 2021-10-05T10:26:44Z

Hi I noticed some weird performance characteristics when using single-threaded compressed writing during developing fastq filter.

These characteristics can easily be reproduced with the following script:

#! /usr/bin/env python3

import sys

import xopen

if __name__ == "__main__":
    threads = sys.argv[1]
    in_file = sys.argv[2]
    out_file = sys.argv[3]
    with open(in_file, "rb") as in_file_h:
        with xopen.xopen(out_file, mode="wb", threads=int(threads),
                         compresslevel=1) as out_file_h:
            for line in in_file_h:
                out_file_h.write(line)

This will default to using IGzipFile when threads=0 and use a PipedPythonIsalWriter when threads=1.

Benchmarks:
No compression for comparison:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq
  Time (mean ± σ):      2.816 s ±  0.129 s    [User: 2.075 s, System: 0.740 s]
  Range (min … max):    2.698 s …  3.148 s    10 runs

Using a piped python-isal writer:

Benchmark #1: python xopen_write_lines.py 1  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):      4.965 s ±  0.090 s    [User: 8.599 s, System: 0.846 s]
  Range (min … max):    4.873 s …  5.186 s    10 runs

User time + system time is roughly 9.4 seconds. 6.6 seconds more than writing to an uncompressed file. That is fair for a 1.6 GB file.

But now single threaded:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):     33.275 s ±  0.483 s    [User: 32.901 s, System: 0.369 s]
  Range (min … max):   32.671 s … 34.153 s    10 runs

That is a tremendous increase. This is due to GzipFile.write() being tremendously expensive. A compress action is done, but also the length and the checksum are updated. There is a lot of overhead in python calls.

By using an io.BufferedWriter, we create a buffer. And write is only called every 8 KB. That makes a huge difference:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):      8.822 s ±  0.235 s    [User: 8.485 s, System: 0.335 s]
  Range (min … max):    8.531 s …  9.168 s    10 runs

Note that this time, the single-threaded solution uses less CPU overall (but more wall clock time). This is the behavior that is expected, where multithreading generally has some overhead compared to single thread.

The change is not very invasive and should increase performance across a range of applications that write small records. Most notably cutadapt, because fastq records are generally 335 bytes or so when using standard illumina 150bp reads.

EDIT: I noticed that on cutadapt 3.5 single-threaded xopen mode is impossible to achieve as xopen(..., threads=0) can not ever be called with the current logic (it calls threads=1 when cores==1 and when cores==0 it uses all available cores so xopen is called with threads=4). So cutadapt won't notice this change. I did make a PR rectifying that. (Single-thread FTW!)

This makes writing in small units such as lines or FASTQ records much faster

rhpvorderman · 2021-10-06T08:55:45Z

https://bugs.python.org/issue45387

marcelm

Thanks! I had some time to look at this (but I did not re-do your measurements).

This is a bit strange because older xopen versions actually wrapped GzipFiles in io.BufferedWriter, and I remember it was quite important to have that for Python 2.7. I removed this in ce7af05 and 8fba6c6. I don’t know anymore why I made the latter commit, but I tend to think it was because repr(gzip.open("file.gz", "w")) results in "<gzip _io.BufferedWriter name='file.gz' 0x7f31d6c2ea60>", which led me to think that there’s no need to wrap this in another BufferedWriter anymore. But that is wrong as I see now: The BufferedWriter in that message refers to the "inner" GzipFile.fileobj with the raw compressed data.

Good that you caught this! Not merging right now because I have a comment.

What is your thought regarding the Python bug you opened? Maybe that’s too easy, but shouldn’t gzip.open(..., "wb") do exactly what you do here, that is, wrap GzipFile in an io.BufferedWriter? It would be quite symmetric to what regular open("file", "wb") does – that already returns an io.BufferedWriter.

src/xopen/__init__.py

rhpvorderman · 2021-10-11T07:16:47Z

What is your thought regarding the Python bug you opened? Maybe that’s too easy, but shouldn’t gzip.open(..., "wb") do exactly what you do here, that is, wrap GzipFile in an io.BufferedWriter? It would be quite symmetric to what regular open("file", "wb") does – that already returns an io.BufferedWriter.

Yes I considered that but I didn't do it because it creates a massive problem with backwards-compatibility. gzip.open is expected to return a gzip.GzipFile, many code may have built on that. So unfortunately that is a no go.

rhpvorderman · 2021-10-11T07:20:49Z

Done. Thanks for the review!

src/xopen/__init__.py

rhpvorderman · 2023-01-25T08:35:25Z

A fix is now also available upstream: python/cpython#101251

rhpvorderman added 2 commits October 5, 2021 12:08

Buffer write calls to _compression.BaseStream objects

b4e382a

This makes writing in small units such as lines or FASTQ records much faster

Fix tests and linting

5ede0f9

rhpvorderman requested a review from marcelm October 5, 2021 11:17

rhpvorderman mentioned this pull request Oct 6, 2021

Force xopen to use the main thread when cores==1 marcelm/cutadapt#572

Merged

marcelm approved these changes Oct 9, 2021

View reviewed changes

src/xopen/__init__.py Outdated Show resolved Hide resolved

Make compression check code more explicit

3811b15

marcelm reviewed Oct 11, 2021

View reviewed changes

src/xopen/__init__.py Outdated Show resolved Hide resolved

Use tupled isinstance

ad773ca

rhpvorderman merged commit d5b2f86 into main Oct 11, 2021

rhpvorderman deleted the buffercompressionwriter branch October 11, 2021 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer compression writers #78

Buffer compression writers #78

rhpvorderman commented Oct 5, 2021 •

edited

Loading

rhpvorderman commented Oct 6, 2021

marcelm left a comment

rhpvorderman commented Oct 11, 2021

rhpvorderman commented Oct 11, 2021

rhpvorderman commented Jan 25, 2023

Buffer compression writers #78

Buffer compression writers #78

Conversation

rhpvorderman commented Oct 5, 2021 • edited Loading

rhpvorderman commented Oct 6, 2021

marcelm left a comment

Choose a reason for hiding this comment

rhpvorderman commented Oct 11, 2021

rhpvorderman commented Oct 11, 2021

rhpvorderman commented Jan 25, 2023

rhpvorderman commented Oct 5, 2021 •

edited

Loading