-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelization of bgzf compression #17
Comments
Hah! I was just thinking about this as well. Unfortunately I think it would be rather complicated to implement--you'd want to parallelize the compression step but preserve the original ordering when writing out the records. It would be super cool though! |
ha. Given my current rust skills im not of much help but some of the multi-threading in htslib can be found here: https://github.com/samtools/htslib/blob/515f6df8f7f7dab6c80d0e7aede6e60826ef5374/bgzf.c#L1722 i believe. Dont know if that is of any help. This is relevant for BAM/bcf format ( i think) as the threads are only used form compression |
Note: I'm not @zaeleus so I can't speak for their plans, but I just started implementing some things using the library and parallel writes were on my mind. I'm a little confused about whether the htslib implementation preserves read order. It looks like they have a threadpool to do the compression and they push the results to a queue, and it seems like that might reorder the reads slightly. Maybe that's accepted behavior, I couldn't find any documentation on this. I was trying to find a way to preserve ordering but it's a tricky problem, e.g. see this Rayon issue which has been open for quite a while. But I think the rayon folks are looking for an extremely general solution, while the heap-queue thing proposed in that issue would probably work great for BAM processing (maybe with a little parameter tweaking). I'm not sure I'm gonna try it, but if you want to try implementing it as a way to learn a ton of Rust all at once, I'd love to see it. 😁 |
The htslib thread pool is complicated! The pool has the notion of input (things to do) and output (completed jobs) queues. The output queue can be read from either as they arrive or in strict order of submission. Clearly with BGZF it is critical to get the ordering correct, so that's how it utilises the thread pool. If we're reading and writing data (eg BAM to BAM, BAM to CRAM, etc) then we have one shared thread pool with multiple input and output queues attached to cover the decode and encode tasks. The tricky bit was stopping deadlocks when a queue fills up, while limiting the number of items (decoding is vastly faster than encoding so you don't want to buffer up most of the uncompressed file because it's running ahead of the encoder tasks). Lots of "fun" to be had here! :-) |
My very sparse C skills are definitely not helping. 😁 So the ordering is preserved there, that's cool! I didn't see how that happened.
I was wondering about the use-case of "read a BAM, do something to each record, write a new BAM" (because that's what I want to do). Would the ordering be preserved there? Maybe that's an implementation detail that depends on the application. |
Yes, for BGZF all ordering is preserved. It needs to be too because BGZF has no guarantees that a gzip block boundary lies on a sequence boundary (and indeed long sequences may need many blocks to cover a single record!). There isn't even a guarantee that if a sequence doesn't fit in a block that it'll start a new BGZF block. It does in our implementation, but it's not required by the specification. That makes random jumping around a bit of a pain in the neck. |
As others have mentioned, I don't think this is trivial. I hope to explore parallel block decompression after #13 and then compression. It will likely be a good opportunity to introduce async interfaces (#9). One thing that should be noted is that noodles-bgzf uses flate2, which is a frontend to different general-purpose zlib-compatible libraries. The default backend is the Rust implementation miniz_oxide, but you can override it in your own application (
This improves both decoding and encoding performance but requires building and linking to a C library. Specialized libraries like libdeflate are not supported. |
I consider multi-threaded read/write for SAM/BAM/CRAM an essential feature for a high performance htslib-style library. Having written tools using java/htsjdk, I had to port key sections of my program to C/htslib purely because htsjdk doesn't support multithreaded SAM parsing (so a
I'm not sure why the consensus in this thread is that adding multithread capability is complicated. Speaking from my htsjdk PR experience, I can say that's it's not too bad if you're willing to accept intermediate buffers & an extra data copy. Essentially, there are two steps in SAM.bgz/BAM/VCF.bgz/BCF I/O: compression/decompression and serialisation/parsing. in-memory format <-> serialised format <-> compressed format For async multithreaded support, it's just a matter of adding buffers between each of these steps and kicking off the async jobs when it's full. The high-level design of my htsjdk implement goes along the lines of: stream of input records -> convert to serialised format -> [CP] chuck records together until 64Kb bgz limit -> compress block -> [CP] write to disk [CP] are the checkpoints where you need to ensure the records are presented to the next step in-order. I implemented this with an incrementing u64 on the input and logic to buffer if recordindex != expectedindex. I used an unbounded threadpool for the actual disk I/O, and a #threads=#cores threadpool for the computational steps (parsing & compression). Since both are purely computational steps a single thread-pool was fine. You can prevent unbounded async queue size through backpressure on the calling thread (for writing) and limiting the amount of read-ahead (for reading). There's probably a more elegant rust-like design that noodles could take but I'm relatively new to rust so I won't presume to recommend a design. Downsides of this approach:
I've recently taken up rust for my latest tool so, if you're open to it, I'm happy to contribute both code and design inputs. I deal mostly in structural variation analysis and contribute to both the SAM and VCF specifications. htsjdk has very poor VCF SV support so it'd be nice to have a library that properly supports these variants (and handles the upcoming VCFv4.4 SV changes). |
Firstly, making SAM MT and speeding up the BAM threading was my work, so thanks for the kind comments about Htslib. :-) Some of my own comments.
All that aside, a naive threading system with X threads for encode and Y for decode is quite simple, so I'd start off with the basics. |
I think the main question is not "can this be done" (of course) but rather "who has the time to do it properly." As I understand it, this isn't exactly @zaeleus's day job. |
Thanks for the comments, @d-cameron and @jkbonfield. I'll take them into consideration while working on this. I'm focusing on simplicity for the the first iteration, i.e., async I/O in the |
I want to mention that both multi-task BGZF block compression (writing) and decompression (reading) are available in the For discussion, below are some casual benchmarks comparing noodles-bgzf (5885fcc) readers and writers against bgzip 1.13. The input is a fruit fly reference genome. It can be prepared by recompressing the file using bgzip.
The uncompressed file is 139 MiB; and the bgzipped-compressed file, 44 MiB. Benchmarks are measured using hyperfine 1.11.0 with 3 warmup runs. Example: Read BGZF
The pure Rust implementation (noodles + miniz_oxide) isn't quite able to keep up the decompression speeds of zlib, particularly in the sync version. Sync noodles + zlib(-ng) can outperform single-threaded bgzip, but multithreaded bgzip outperforms noodles + zlib(-ng). Example: Write BGZF (compression level = 6)
The write benchmarks are surprising. So much so that I'll confirm that the output is correct:
noodles + miniz_oxide has better write characteristics than zlib; however, zlib-ng shows that it provides significant performance improvements, particularly for being a drop-in replacement. Curiously, noodles + zlib and bgzip perform nearly the same in both contexts. |
Zlib-ng appears to have come on well then. Last I tried it wasn't a huge leap above zlib. Are the sizes comparable? If you're curious, you may also wish to try bgzip with libdeflate. This isn't a drop-in replacement, but htslib has bindings for it. It's also a very fast modern implementation of deflate, but it also has a faster decompression too. So it's our recommended method for htslib. There are some fun benchmarks at https://quixdb.github.io/squash-benchmark/unstable/ |
Fantastic benchmarking Michael, kudos! Also nice compression resource James! I recently came across this one, perhaps you knew about it already?: http://www.mattmahoney.net/dc/text.html I suppose that there's no urge to implement NN-based compression (looking at the nr1 place), quite impressive :-! Other than that, I wanted to comment that, if possible, those C-bindings should be entirely optional? One of the main reasons I started using Noodles was to avoid C dependencies (after dealing with rust-htslib). |
Yep, I've been a long follower of Matt's work. His sibling page on http://mattmahoney.net/dc/dce.html is also a useful resource. I also regularly look at the text compression page (as well as encode.su). I find the really slow tools can still be a useful research guide. For example, if we can get within a few percent of them for a specific type of data (eg read names in CRAM) then I know I can stop trying for better compression - close is "good enough" given the CPU time is vastly superior. There are a few serial winners which always perform amazingly for the speed / size tradeoffs - bsc and mcm. I've considered both for CRAM 4, but as yet they're not in there. Agreed that yes all bindings ought to be optional, permitting a nice simple pure-rust implementation while offering potential gains for those who want the hassle of a more complicated setup. For the same reason, htslib can build against standard zlib without needing libdeflate. |
Yes, all the outputs are close: 44.3 MiB +/- 0.07 MiB.
The C libraries (zlib, zlib-ng, etc.) are definitely optional. The Rust implementation miniz_oxide is the default used in noodles. I chose to show different backends since encoding/decoding tend to be the bottleneck. They're easy to change via feature flags [1] [2] but do require their specific toolchains (e.g., zlib-ng needs cmake to build). |
The async branch was merged into the main branch. It currently includes BAM (R/W), BAI (R/W), BCF (R), BGZF (R/W), tabix (R/W), and VCF (R/W). It is part of noodles 0.5.0 release. As for the initial use case of this issue to write filtered BAM files, here are some notable changes:
See |
This is fantastic Michael!!! We were discussing with @mmalenic about htsget-rs async port (soon to be merged) and couldn't help notice that CRAM and CSI are not async (yet?). What roadblocks did you find for those two? Are they significantly more complex to implement or is it just a matter of time? Please don't take this as criticism, your implementation/fixing speed is admirable :) |
This is awesome, thanks for all your work @zaeleus! I am going to try this very soon. |
No blockers (that I know of :)). They're on the roadmap and will be in a (near) future release. |
Very nice... |
hi,
I used noodles to modify and filter bam files with the primary focus of learning rust. Is it possible to parallelize the compression part when writing to a bam file.
Great package :)
Kristian
The text was updated successfully, but these errors were encountered: