Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for crabz. #117

Closed
ghuls opened this issue Jan 10, 2023 · 10 comments
Closed

Add support for crabz. #117

ghuls opened this issue Jan 10, 2023 · 10 comments

Comments

@ghuls
Copy link

ghuls commented Jan 10, 2023

Add support for crabz: https://github.com/sstadick/crabz as a pigz/bgzip alternative.

Supported formats:

Gzip
Zlib
Mgzip block compression format with no block size limit
BGZF block compression format limited to 64 Kb blocks
Raw Deflate
Snap

Benchmarks

These benchmarks use the data in bench-data catted together 100 times. Run with bash ./benchmark.sh data.txt.

Benchmark system specs: Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive

pigz v2.4 installed via apt on Ubuntu

Takeaways:

crabz with zlib backend is pretty much identical to pigz
crabz with zlib-ng backend is roughly 30-50% faster than pigz
crabz with rust backend is roughly 5-10% faster than pigz

It is already known that zlib-ng is faster than zlib, so none of this is groundbreaking. However, I think crabz gets an an edge due to the following:

crabz with deflate_rust backend is using all Rust only code, which is in theory more secure / safe.
crabz with zlib-ng is easier to install than pigz with a zlib-ng backend
crabz supports more formats than pigz
crabz is cross platform and can run on windows

With regards to block formats like Mgzip and BGZF, crabz is using libdeflater by default which excels at compressing and decompression known-sized blocks. This makes block compression formats very fast at a small loss to the compression ratio.

See end of benchmarks section for comparison against bgzip.

As crabz is just a wrapper for the gzp library, the most exciting thing about these benchmarks is that gzp is on par with best in class CLI tools for multi-threaded compression and decompression as a library.

@ghuls ghuls changed the title Add support for crabz: Add support for crabz. Jan 10, 2023
@rhpvorderman
Copy link
Collaborator

So these are all formats using the DEFLATE algorithm. We already support three tools that can compress/decompress these. Pigz, gzip and igzip. Of these igzip is probably the fastest. In fact, if igzip is not available we use python -m isal.igzip to decompress. I suggest you also test that to see if crabz is really faster than python -m isal.igzip -cd big.gz > /dev/null. That also runs on windows.

It could be nice for compression though, especially at the higher levels. Personally I set all my tools to level 1 for speed and there is no beating igzip (custom assembly is faster than C or rust). I think one of the downstream tools that uses xopen, cutadapt, uses compression level 5 by default now. In the tables I see that crabz seems to be much faster than pigz for this compression level, even at one thread. I see speedups of more than 2x reported. So this is interesting for some independent veriification. Ping @marcelm

@ghuls
Copy link
Author

ghuls commented Jan 17, 2023

crabz is faster than pigz with zlib as crabz is using zlib-ng by default. It is possible to get pigz to use zlib-ng too (speed would be similar to crabz), but it is a bit more hassle to get it working as for now most distributions still use the old zlib library.
In general zlib-ng is 2x faster than zlib.

igzip compression size is quite bad, so I wouldn't consider it an alternative for proper gzip compression.
For decompression, it is indeed the best tool for the job.

Another advantage of crabz is that it can do BGZF block compression (like bgzip/HTSlib) and Mgzip block compression, which use libdeflate instead of zlib/zlib-ng, which is a lot faster while still giving very good compression (especially compared with igzip).

@rhpvorderman
Copy link
Collaborator

igzip compression size is quite bad, so I wouldn't consider it an alternative for proper gzip compression.

Well, it only supports levels 1, 2 and 3. But especially at level one I find that it compresses my data better than gzip. Igzip also has a special level 0 which is quite bad indeed. It is much faster than any other compression though. It is faster in compressing than zstd while providing a better compression ratio. Quite an interesting tool for compressing enormous amounts of data.

Another advantage of crabz is that it can do BGZF block compression (like bgzip/HTSlib) and Mgzip block compression, which use libdeflate instead of zlib/zlib-ng, which is a lot faster while still giving very good compression (especially compared with igzip)

Due to the specification of the gzip format multiple gzip members may be concatenated together. It is a bit of a shortcut doing this instead o creating one big stream. But if it works it works. I wonder what this does for decompression speed though. That needs to be benchmarked as well. If it does not matter, (probably not, gzip header and footer are only 18 bytes together), it is an interesting way to get fast, better compression using libdeflate.

@rhpvorderman
Copy link
Collaborator

I did some benchmarks evaluating the use of crabz. I also found that zlib-ng has its own gzip-ish implementation called minigzip
Compression

pigz -p 1 -c -k -1 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):     18.588 s ±  0.128 s    [User: 18.386 s, System: 0.198 s]
  Range (min … max):   18.497 s … 18.923 s    10 runs
 
./minigzip -c -k -1 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):      5.549 s ±  0.120 s    [User: 5.332 s, System: 0.217 s]
  Range (min … max):    5.387 s …  5.799 s    10 runs

Benchmark 1: crabz -p 1 -o -  -l 1 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):      5.484 s ±  0.044 s    [User: 5.293 s, System: 0.191 s]
  Range (min … max):    5.418 s …  5.591 s    10 runs

Benchmark 1: igzip -c -k -1 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):      3.125 s ±  0.104 s    [User: 2.939 s, System: 0.186 s]
  Range (min … max):    3.043 s …  3.311 s    10 runs

Decompression (file compressed with original gzip at level 1)

Benchmark 1: pigz -p 1 -c -k -d ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      6.113 s ±  0.024 s    [User: 6.032 s, System: 0.080 s]
  Range (min … max):    6.052 s …  6.138 s    10 runs

./minigzip -c -k -d ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      3.454 s ±  0.097 s    [User: 3.376 s, System: 0.078 s]
  Range (min … max):    3.395 s …  3.683 s    10 runs

Benchmark 1: crabz -p 1 -o -  -d ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      4.312 s ±  0.032 s    [User: 4.245 s, System: 0.067 s]
  Range (min … max):    4.266 s …  4.368 s    10 runs
 
Benchmark 1: igzip -c -k -d ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      2.102 s ±  0.028 s    [User: 2.060 s, System: 0.041 s]
  Range (min … max):    2.059 s …  2.137 s    10 runs

So it looks like minigzip and crabz are tied for compression. Both are massively faster than pigz, which is our fallback now. Igzip is yet even faster, but does not support higher compression levels. Currently we do a fallback to pigz for the higher compression levels. So it is interesting to choose one of these.

For decompression crabz is slower than minigzip, but I would say this is irrelevant as igzip is already so much faster anyway and that can decompress any gzip file regardless of compression levels. So lets check the default for cutadapt level 5 for compression and see what happens:

Benchmark 1: crabz -p 1 -o -  -l 5 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):     25.310 s ±  0.102 s    [User: 25.104 s, System: 0.200 s]
  Range (min … max):   25.094 s … 25.485 s    10 runs

Benchmark 1: ./minigzip -c -k -5 ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):     25.205 s ±  0.283 s    [User: 24.978 s, System: 0.226 s]
  Range (min … max):   24.574 s … 25.590 s    10 runs

No difference. Except that crabz does support threads. As such it would be more suited to xopen.

Let's check libdeflate's performance using bgzip and crabz. First sizes:

bgzip -c ~/test/5millionreads_R1.fastq -l 5 | wc -c
365239586
crabz -o - ~/test/5millionreads_R1.fastq -l 5 -f bgzf | wc -c
[2023-01-20T14:25:25Z INFO  crabz] Compressing (bgzf) with 8 threads at compression level 5.
375096403
crabz -o - ~/test/5millionreads_R1.fastq -l 5 | wc -c
[2023-01-20T14:26:48Z INFO  crabz] Compressing (gzip) with 8 threads at compression level 5.
359397693
./minigzip -c -5 ~/test/5millionreads_R1.fastq| wc -c
359163831

Looks like at level 5, streaming compression is more space-efficient. So I did some benchmarks on the speed as well. This is similar for level 5 (bgzip being slightly slower) as normal streaming compression. So I am not convinced yet.

Allright maybe this blocks work better at level 1?

Benchmark 1: bgzip -c -l 1  ~/test/5millionreads_R1.fastq > /dev/null
  Time (mean ± σ):      8.081 s ±  0.086 s    [User: 7.871 s, System: 0.210 s]
  Range (min … max):    7.879 s …  8.169 s    10 runs

Nope, not faster than crabz, minigzip, igzip. How about the size?

bgzip -c -l 1  ~/test/5millionreads_R1.fastq | wc -c
396056594
pigz -c -1 -p 1 ~/test/5millionreads_R1.fastq | wc -c
430814135

Quite good compared to the default zlib implementation. So how do the others compare?

crabz -o - ~/test/5millionreads_R1.fastq -l 1 --format bgzf -p 1| wc -c
397161441
crabz -o - ~/test/5millionreads_R1.fastq -l 1 --format gzip -p 1| wc -c
630623213
igzip -c -1  ~/test/5millionreads_R1.fastq | wc -c
405313321
./minigzip -c -1  ~/test/5millionreads_R1.fastq | wc -c
630672121

Wait what? Regular zlib-ng compression at level 1 results in 50% larger files than regular zlib level 1? No wonder it is so much faster. It is also completely unacceptable. @ghuls, did you see this? Also note that igzip while being faster than all the alternatives manages to have a better than zlib-compression. Can you clarify where your following statement was based on?

igzip compression size is quite bad, so I wouldn't consider it an alternative for proper gzip compression.

Based on these results I would say that:

  • zlib-ng works fine only for higher compression levels.
  • Block-gzips do not provide additional benefit.

This seems interesting for xopen as ISA-L works excellent only for compressing lower levels. I think crabz can be a good implementation choice for higher compression levels as it is multithreaded and implemented in conda, but as it has a slightly different command line interface, it may take some work to get it going.

@ghuls
Copy link
Author

ghuls commented Jan 20, 2023

Wait what? Regular zlib-ng compression at level 1 results in 50% larger files than regular zlib level 1? No wonder it is so much faster. It is also completely unacceptable. @ghuls, did you see this? Also note that igzip while being faster than all the alternatives manages to have a better than zlib-compression. Can you clarify where your following statement was based on?

igzip compression size is quite bad, so I wouldn't consider it an alternative for proper gzip compression.

Based on these results I would say that:

When I use compression, I don't use level 1 but level 6 as it is more for permanent storage. igzip at level 3 (best possible compression) results in way bigger files than zlib(-ng) at 6. For multi GB files, the difference can be big in overall used filesize.

BTW igzip is not always the fastest option for decompression. If your input is BGZF compressed like a lot of files are in bioinformatics, crabz can outperform igzip in decompression from 2 threads.

❯  hyperfine -m 3 -- 'igzip -c -k -d test_orig.fastq.gz > /dev/null'
Benchmark 1: igzip -c -k -d test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     24.112 s ±  0.572 s    [User: 23.107 s, System: 0.914 s]
  Range (min … max):   23.715 s … 24.768 s    3 runs
 
❯  hyperfine -m 3 -- 'minigzip -c -k -d test_orig.fastq.gz > /dev/null'
Benchmark 1: minigzip -c -k -d test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     55.718 s ±  0.085 s    [User: 54.939 s, System: 0.669 s]
  Range (min … max):   55.667 s … 55.816 s    3 runs

❯  hyperfine -m 3 -- 'crabz -p 1 -o - -d test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 1 -o - -d test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     43.499 s ±  0.082 s    [User: 42.904 s, System: 0.509 s]
  Range (min … max):   43.406 s … 43.558 s    3 runs
 
❯  hyperfine -m 3 -- 'crabz -p 4 -o - -d test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 4 -o - -d test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     43.436 s ±  0.045 s    [User: 42.807 s, System: 0.542 s]
  Range (min … max):   43.402 s … 43.488 s    3 runs

# Parallel decompression of each bgzf block (FASTQ file is bgzip compressed).
❯  hyperfine -m 3 -- 'crabz -p 1 -o - -d -f bgzf test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 1 -o - -d -f bgzf test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     39.034 s ±  0.039 s    [User: 44.736 s, System: 4.198 s]
  Range (min … max):   39.005 s … 39.078 s    3 runs

❯  hyperfine -m 3 -- 'crabz -p 2 -o - -d -f bgzf test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 2 -o - -d -f bgzf test_orig.fastq.gz > /dev/null
  Time (mean ± σ):     19.590 s ±  0.026 s    [User: 44.263 s, System: 2.993 s]
  Range (min … max):   19.562 s … 19.615 s    3 runs

❯  hyperfine -m 3 -- 'crabz -p 4 -o - -d -f bgzf test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 4 -o - -d -f bgzf test_orig.fastq.gz > /dev/null
  Time (mean ± σ):      9.912 s ±  0.042 s    [User: 44.898 s, System: 2.741 s]
  Range (min … max):    9.874 s …  9.957 s    3 runs

❯  hyperfine -m 3 -- 'crabz -p 16 -o - -d -f bgzf test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 16 -o - -d -f bgzf test_orig.fastq.gz > /dev/null
  Time (mean ± σ):      2.446 s ±  0.035 s    [User: 40.644 s, System: 1.024 s]
  Range (min … max):    2.405 s …  2.469 s    3 runs
 
❯  hyperfine -m 3 -- 'crabz -p 32 -o - -d -f bgzf test_orig.fastq.gz > /dev/null'
Benchmark 1: crabz -p 32 -o - -d -f bgzf test_orig.fastq.gz > /dev/null
  Time (mean ± σ):      2.488 s ±  0.502 s    [User: 52.258 s, System: 1.724 s]
  Range (min … max):    1.919 s …  2.871 s    3 runs

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Jan 21, 2023

BTW igzip is not always the fastest option for decompression. If your input is BGZF compressed like a lot of files are in bioinformatics, crabz can outperform igzip in decompression from 2 threads.

Interesting point. I really do appreciate the benchmark figures you posted. Maybe the optimal thing to do is to check for the bgzip header and choos the application according to the number of threads?

if bgzip:
    if threads == 0:
         return igzip.open(my_bgzip,  mode)
    elif threads == 1:
        return PipedIgzipReader(my_bgzip, mode)
    else:
        return PipedCrabzReader(my_bgzip, mode)

My preference is to always optimize for lowest compute time, but therefore I also always prefer to run with threads=0. So I do not really have a stake in what happens at the higher thread counts. Still, I think in xopen's use case optimizing for the least amount of compute time is still the most optimal thing to do, even when more threads are available. The goal is to unpack the gzip so a python application can actually do something with it. With igzip decompressing at 500MB+ per second it rarely is the bottleneck, the extra threads in crabz help, but only if the application is bottlenecked by the decompression speed. If the application is the actual bottleneck the extra compute time of crabz will lead to a higher cpu temperature which leads to a lower clock speed, also affecting wall clock time. On the other hand, if decompression is the bottleneck there will be a massive decrease in wall clock time when using crabz with more threads.

It is not a clear-cut case but it is very interesting to think about the optimal thing to do here.

When I use compression, I don't use level 1 but level 6 as it is more for permanent storage. igzip at level 3 (best possible compression) results in way bigger files than zlib(-ng) at 6. For multi GB files, the difference can be big in overall used filesize.

I think for on-the-fly compression of temporary files in a workflow, level 1 is the most appropriate and that is where igzip really shines. I do agree though that is inadequate for storage and xopen would benefit from having faster applications available at the higher compression levels. Crabz is a really good fit for that purpose, so thanks for bringing it up!

@marcelm
Copy link
Collaborator

marcelm commented Feb 1, 2023

Hi, I thought I’d chime in as well. I always find these raw measurements hard to read, so I took the opportunity to clean up (very slightly) a Jupyter notebook that I made last year and make the repository public:
https://github.com/marcelm/compression/blob/main/compression.ipynb
I added a new plot at the bottom that 1) tests crabz – but only in single-thread mode and 2) uses the crabz benchmark data (100x shakespeare.txt).

Here’s just the plot I mean:
crabz
(Sorry this is hard to read. You can go to the notebook to see the raw data.)

Choice of compression tool is a multi-objective optimization problem. By plotting CPU time and compression ratio on two axes, I think this becomes obvious visually. Of course, just from the fact that I ran all tools on only one thread, it’s clear that the plot doesn’t give the full picture because there are more criteria that play a role, for example:

  • Compression speed in terms of wall-clock time (that is, threading support)
  • Decompression speed
  • Format maturity, tool availability etc.

I would tend to agree with Ruben that a good default is to optimize for low CPU time. But I am also interested in improving xopen for gzip compression at higher compression levels.

@rhpvorderman
Copy link
Collaborator

I was just trying to include crabz but there is a major blocker: it does not support the --no-name flag which does not save timestamp and filename. This is required for reproducible output. We can't regress on that feature.

@rhpvorderman
Copy link
Collaborator

sstadick/crabz#21

@rhpvorderman
Copy link
Collaborator

Given that I finally got around to making a good python threaded solution in python-isal, I can port that to python-zlib-ng after a shake-out period. Then using crabz will always be less efficient than using python's own threads. Therefore I will close this issue and refer to #126 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants