-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for crabz. #117
Comments
So these are all formats using the DEFLATE algorithm. We already support three tools that can compress/decompress these. Pigz, gzip and igzip. Of these igzip is probably the fastest. In fact, if igzip is not available we use It could be nice for compression though, especially at the higher levels. Personally I set all my tools to level 1 for speed and there is no beating igzip (custom assembly is faster than C or rust). I think one of the downstream tools that uses xopen, cutadapt, uses compression level 5 by default now. In the tables I see that crabz seems to be much faster than pigz for this compression level, even at one thread. I see speedups of more than 2x reported. So this is interesting for some independent veriification. Ping @marcelm |
crabz is faster than pigz with zlib as crabz is using zlib-ng by default. It is possible to get pigz to use zlib-ng too (speed would be similar to crabz), but it is a bit more hassle to get it working as for now most distributions still use the old zlib library. igzip compression size is quite bad, so I wouldn't consider it an alternative for proper gzip compression. Another advantage of crabz is that it can do BGZF block compression (like bgzip/HTSlib) and Mgzip block compression, which use libdeflate instead of zlib/zlib-ng, which is a lot faster while still giving very good compression (especially compared with igzip). |
Well, it only supports levels 1, 2 and 3. But especially at level one I find that it compresses my data better than gzip. Igzip also has a special level 0 which is quite bad indeed. It is much faster than any other compression though. It is faster in compressing than zstd while providing a better compression ratio. Quite an interesting tool for compressing enormous amounts of data.
Due to the specification of the gzip format multiple gzip members may be concatenated together. It is a bit of a shortcut doing this instead o creating one big stream. But if it works it works. I wonder what this does for decompression speed though. That needs to be benchmarked as well. If it does not matter, (probably not, gzip header and footer are only 18 bytes together), it is an interesting way to get fast, better compression using libdeflate. |
I did some benchmarks evaluating the use of crabz. I also found that zlib-ng has its own gzip-ish implementation called minigzip
Decompression (file compressed with original gzip at level 1)
So it looks like minigzip and crabz are tied for compression. Both are massively faster than pigz, which is our fallback now. Igzip is yet even faster, but does not support higher compression levels. Currently we do a fallback to pigz for the higher compression levels. So it is interesting to choose one of these. For decompression crabz is slower than minigzip, but I would say this is irrelevant as igzip is already so much faster anyway and that can decompress any gzip file regardless of compression levels. So lets check the default for cutadapt level 5 for compression and see what happens:
No difference. Except that crabz does support threads. As such it would be more suited to xopen. Let's check libdeflate's performance using bgzip and crabz. First sizes:
Looks like at level 5, streaming compression is more space-efficient. So I did some benchmarks on the speed as well. This is similar for level 5 (bgzip being slightly slower) as normal streaming compression. So I am not convinced yet. Allright maybe this blocks work better at level 1?
Nope, not faster than crabz, minigzip, igzip. How about the size?
Quite good compared to the default zlib implementation. So how do the others compare?
Wait what? Regular zlib-ng compression at level 1 results in 50% larger files than regular zlib level 1? No wonder it is so much faster. It is also completely unacceptable. @ghuls, did you see this? Also note that igzip while being faster than all the alternatives manages to have a better than zlib-compression. Can you clarify where your following statement was based on?
Based on these results I would say that:
This seems interesting for xopen as ISA-L works excellent only for compressing lower levels. I think crabz can be a good implementation choice for higher compression levels as it is multithreaded and implemented in conda, but as it has a slightly different command line interface, it may take some work to get it going. |
When I use compression, I don't use level 1 but level 6 as it is more for permanent storage. igzip at level 3 (best possible compression) results in way bigger files than zlib(-ng) at 6. For multi GB files, the difference can be big in overall used filesize. BTW igzip is not always the fastest option for decompression. If your input is BGZF compressed like a lot of files are in bioinformatics, crabz can outperform igzip in decompression from 2 threads.
|
Interesting point. I really do appreciate the benchmark figures you posted. Maybe the optimal thing to do is to check for the bgzip header and choos the application according to the number of threads? if bgzip:
if threads == 0:
return igzip.open(my_bgzip, mode)
elif threads == 1:
return PipedIgzipReader(my_bgzip, mode)
else:
return PipedCrabzReader(my_bgzip, mode) My preference is to always optimize for lowest compute time, but therefore I also always prefer to run with threads=0. So I do not really have a stake in what happens at the higher thread counts. Still, I think in xopen's use case optimizing for the least amount of compute time is still the most optimal thing to do, even when more threads are available. The goal is to unpack the gzip so a python application can actually do something with it. With igzip decompressing at 500MB+ per second it rarely is the bottleneck, the extra threads in crabz help, but only if the application is bottlenecked by the decompression speed. If the application is the actual bottleneck the extra compute time of crabz will lead to a higher cpu temperature which leads to a lower clock speed, also affecting wall clock time. On the other hand, if decompression is the bottleneck there will be a massive decrease in wall clock time when using crabz with more threads. It is not a clear-cut case but it is very interesting to think about the optimal thing to do here.
I think for on-the-fly compression of temporary files in a workflow, level 1 is the most appropriate and that is where igzip really shines. I do agree though that is inadequate for storage and xopen would benefit from having faster applications available at the higher compression levels. Crabz is a really good fit for that purpose, so thanks for bringing it up! |
Hi, I thought I’d chime in as well. I always find these raw measurements hard to read, so I took the opportunity to clean up (very slightly) a Jupyter notebook that I made last year and make the repository public: Here’s just the plot I mean: Choice of compression tool is a multi-objective optimization problem. By plotting CPU time and compression ratio on two axes, I think this becomes obvious visually. Of course, just from the fact that I ran all tools on only one thread, it’s clear that the plot doesn’t give the full picture because there are more criteria that play a role, for example:
I would tend to agree with Ruben that a good default is to optimize for low CPU time. But I am also interested in improving xopen for gzip compression at higher compression levels. |
I was just trying to include crabz but there is a major blocker: it does not support the |
Given that I finally got around to making a good python threaded solution in python-isal, I can port that to python-zlib-ng after a shake-out period. Then using crabz will always be less efficient than using python's own threads. Therefore I will close this issue and refer to #126 . |
Add support for crabz: https://github.com/sstadick/crabz as a pigz/bgzip alternative.
The text was updated successfully, but these errors were encountered: