Which tools to choose based on various performance metrics.
For fair comparison between tools it is best to use containerized environments. These eliminate any user specific installs or compilations.
For running the tools Singularity was used. The containers were provided by biocontainers.
Tools were run with singularity exec -eip
docker://quay.io/biocontainers/<tool>:<version-tag> command
. The -e
,
-i
and -p
flags isolate environment, IPC namespace and PID namespace
respectively. By default singularity mounts $PWD
and user HOME
directories, which was convenient when testing.
For benchmarking [hyperfine](https://github.com/sharkdp/hyperfine) was used. This command line tool is useful as it can perform warmup runs to ensure images are downloaded and disk caches are full.
Multithreading is a programming technique that allows a program to split up its execution in multiple independent threads that allow for parallel execution on the CPU.
Advantages:
- More efficient use of resources (less CPU's in idle state)
- Significant decrease in wall clock time if used correctly.
Disadvantages:
- Performance overhead compared to single-threaded algorithms
- More complex code required
When threads share resources some conditions must be satisfied to ensure the threads don't alter the shared resource at the same time (race condition) or wait on each other to change the resource and therefore never changing it (deadlock). Taking care these problems is what makes the code more complex and comes with a performance penalty.
Some algorithms come with a very small performance overhead with regards to multithreading some with a much bigger one.
Given the overhead of using more than one thread, the default should be one thread. Situations in which more threads can be considered are:
- A reduction in wall clock time is required or desirable
- A program is bottlenecking other applications
- Using a program in single-threaded mode will leave resources idle.
Examples:
- When time is no object, it may be cheaper to use one thread per task on cloud backends. This will decrease the amount of CPU time needed.
- On clinical pipelines, getting the diagnose one day earlier may be worth more than computational overhead.
- The program is used in a pipe and other programs spend a lot of time waiting on it.
- mapping/alignment tools require a lot of memory for the index, but not much additional memory per thread. On a node that has 16 GB and 4 threads it is more efficient to use all 4 threads and 16 GB of ram, than to use 12 GB of ram on one thread.
The bioinformatics world seems to have standardised on zlib. Most bioinformatics tools can open .gz
files and
output to them. Therefore tools that implement zlib should be used for
compressing files.
Not all zlib implementations are created equal.
The most well-known zlib implementation is gzip. Gzip
is very old and is written as a single-threaded algorithm. Another
implementation is pigz, which was written to make use
of multiple threads. Pigz is always faster than gzip. Even on a single core
the algorithm is simply more efficient. pigz -p 1
(no multithreading) beats
gzip
in both compression and decompression. [Cutadapt](
https://github.com/marcelm/cutadapt) therefore does its compression and
decompression with pigz.
Why pigz:
$ hyperfine -w 3 -r 10 "gzip -1 -c big.fastq > /dev/null" Benchmark #1: gzip -1 -c big.fastq > /dev/null Time (mean ± σ): 22.335 s ± 0.284 s [User: 22.123 s, System: 0.202 s] Range (min … max): 21.963 s … 22.890 s 10 runs $ hyperfine -w 3 -r 10 "pigz -p 1 -1 -c big.fastq > /dev/null" Benchmark #1: pigz -p 1 -1 -c big.fastq > /dev/null Time (mean ± σ): 19.343 s ± 0.225 s [User: 19.085 s, System: 0.243 s] Range (min … max): 19.093 s … 19.737 s 10 runs $ hyperfine -w 3 -r 10 "gzip -cd big.fastq.gz > /dev/null" Benchmark #1: gzip -cd big.fastq.gz > /dev/null Time (mean ± σ): 10.363 s ± 0.086 s [User: 10.274 s, System: 0.059 s] Range (min … max): 10.217 s … 10.471 s 10 runs $ hyperfine -w 3 -r 10 "pigz -p 1 -cd big.fastq.gz > /dev/null" Benchmark #1: pigz -p 1 -cd big.fastq.gz > /dev/null Time (mean ± σ): 6.196 s ± 0.037 s [User: 6.134 s, System: 0.053 s] Range (min … max): 6.132 s … 6.250 s 10 runs
Pigz is more than 10% faster when compressing and more than 40 % (!!!) faster when decompressing.
There are several zlib compression levels, from 1 to 9. Alternatively a file can not be compressed. Should a file be compressed, and if so, at which level?
Compression was done using pigz -p 1
level | time (seconds) | size | relative time | relative size |
---|---|---|---|---|
uncompressed | 0.914 | 2,3G | 0.05 | 4.44 |
1 | 19.770 | 513M | 1.00 | 1.00 |
2 | 21.713 | 496M | 1.10 | 0.97 |
3 | 30.370 | 467M | 1.54 | 0.91 |
4 | 28.196 | 465M | 1.43 | 0.91 |
5 | 47.708 | 446M | 2.41 | 0.87 |
6 | 108.467 | 402M | 5.49 | 0.78 |
7 | 174.239 | 386M | 8.81 | 0.75 |
8 | 220.316 | 383M | 11.14 | 0.75 |
9 | 223.419 | 383M | 11.30 | 0.75 |
The used data was quite repetitive so might not have been a best benchmark for the highest compression levels. What we see is that anything above compression level 1 uses disproportionally more compute time for the benefits it gives. Compression level 4 might be worth it but takes 43% compute time for a 9% smaller filesize.
The default compression should be level 1. What we see in for example Picard
Markduplicates
is that the execution time is halved when compression level
is set from 5 (default) to 1.
For short read DNA alignment, BWA has been the tool of choice. It is used by many insitutions as the go to standard.
When evaluating the performance of BIOWDL's germline-DNA pipeline, it was found that the alignment step using BWA takes ~50% of the wall-clock time and 70% of the compute time. This was after all the remaing steps in the pipeline had optimized compression levels. Alignment is by far the most expensive step. The reason that the wall clock time is less than the compute time is that the task can be parallelized very well with virtually no loss of efficiency.
To solve the long alignment times a new tool has come along: bwa-mem2. It promises to deliver increased speeds at the cost of an increase in memory. This increase in memory was quite substantial (10x) but since bwa-mem2 2.1 the increase has been reduced to 3x. It is therefore more viable for use generally, and can now be tested locally on a workstation which makes it easier to have apples-to-apples performance comparisons.
To compare bwa and bwa-mem2 the reference genome for pipelines: hg38 including
alt contigs and hs38d1 decoy sequences
was used. It is provided by the NCBI's ftp site
. The BWA index can also be downloaded from that page.
The index for bwa-mem2 was generated from the fasta file, with the .alt
index file from the bwa index (it's compatible).
The input FASTQ file was a 5 million record FASTQ file from the Genome in a bottle sample with read length 150. For a total of 1,6 GB. The file was uncompressed and stored on a NVME SSD drive. The indexes were stored on the same drive. The output was written as uncompressed SAM to a in-memory disk of 4GB.
Bwa-mem2 comes with several executables, each optimized for a particular instruction set. By default it chooses the fastest instruction set allowed by the CPU. These instruction sets allow fasts operations on vectors. Meaning that larger amounts of numbers can be manipulated at the same time for much less cost. The different instruction sets were benchmarked. SSE4.1, SSE4.2, AVX and AVX2 were tested as these were supported by the used CPU for testing (AMD Ryzen 5 3600). AVX512 is only available on modern Intel CPUs and was not tested.
Bwa-mem2 is provided as both source and as binaries on its github page. The binaries are compiled using the Intel C compiler. This compiler has been known to deliberately reduce performance on non-Intel CPU's. It is also proprietary. Therefore the bioconda distribution <http://bioconda.github.io/recipes/bwa-mem2/README.html>_ uses the GNU compiler collection. Both Intel and GNU compiled binaries were tested. Also the source was compiled locally using GCC and AOCC (AMD Optimizing C Compiler) to see how much value was added by local compilation.
Versions used were BWA 0.7.17 and bwa-mem2 2.1
The baseline is BWA. This uses slightly over 6 GB in memory (6.3 GB).
$ hyperfine -w 2 -r 5 'bwa mem -t 8 -o ramdisk/bwa-mem.sam GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa mem -t 8 -o ramdisk/bwa-mem.sam GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 179.499 s ± 0.700 s [User: 1425.645 s, System: 3.179 s] Range (min … max): 179.042 s … 180.719 s 5 runs
bwa-mem2 uses almost 19GB (18.9 GB) maximum which is approximately 3 times the memory used in BWA. Below follow the GNU Compiler compiled binary results. Note how the instruction sets have a big impact on performance. They are ordered from older to newer.
$ hyperfine -w 2 -r 5 'bwa-mem2.sse41 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2.sse41 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 139.950 s ± 4.528 s [User: 974.764 s, System: 9.624 s] Range (min … max): 136.166 s … 145.029 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2.sse42 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2.sse42 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 124.358 s ± 0.194 s [User: 881.606 s, System: 9.801 s] Range (min … max): 124.187 s … 124.636 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2.avx mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2.avx mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 115.176 s ± 0.594 s [User: 807.904 s, System: 9.872 s] Range (min … max): 114.439 s … 116.068 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 96.902 s ± 0.184 s [User: 662.478 s, System: 9.813 s] Range (min … max): 96.756 s … 97.118 s 5 runs
Below are the results using the binaries found on the bwa-mem2 github page.
$ hyperfine -w 2 -r 5 'bwa-mem2-2.1_x64-linux/bwa-mem2.sse41 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-2.1_x64-linux/bwa-mem2.sse41 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 124.173 s ± 0.089 s [User: 874.309 s, System: 9.747 s] Range (min … max): 124.056 s … 124.292 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2-2.1_x64-linux/bwa-mem2.sse42 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-2.1_x64-linux/bwa-mem2.sse42 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 121.848 s ± 0.966 s [User: 856.020 s, System: 9.917 s] Range (min … max): 121.142 s … 123.442 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2-2.1_x64-linux/bwa-mem2.avx mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-2.1_x64-linux/bwa-mem2.avx mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 111.642 s ± 0.145 s [User: 774.780 s, System: 9.681 s] Range (min … max): 111.502 s … 111.861 s 5 runs $ hyperfine -w 2 -r 5 'bwa-mem2-2.1_x64-linux/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-2.1_x64-linux/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 93.252 s ± 0.477 s [User: 626.916 s, System: 10.021 s] Range (min … max): 92.757 s … 93.881 s 5 runs
Compiled on the test machine using GCC:
$ hyperfine -w 2 -r 5 'bwa-mem2-source/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-source/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 95.301 s ± 0.446 s [User: 645.498 s, System: 9.800 s] Range (min … max): 94.923 s … 95.942 s 5 runs
Compiled on the test machine using AOCC:
$ hyperfine -w 2 -r 5 'bwa-mem2-source/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq' Benchmark #1: bwa-mem2-source/bwa-mem2.avx2 mem -t 8 -o ramdisk/bwa-mem2.sam bwa-mem2/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna ../big2.fastq Time (mean ± σ): 91.842 s ± 0.105 s [User: 618.004 s, System: 9.729 s] Range (min … max): 91.685 s … 91.961 s 5 runs
Relative time to completion of executables distributed on bioconda. (Compiled using GCC). Lower is better
program | relative time |
---|---|
BWA | 100.00% |
bwa-mem2.sse41 | 77.97% |
bwa-mem2.sse42 | 69.22% |
bwa-mem2.avx | 64.16% |
bwa-mem2.avx2 | 53.98% |
Relative time to completion of bwa-mem2 given compilation methods. This only compares bwa-mem2.avx2 as it was the fastest on the test system.
method | relative time |
---|---|
Conda distributed (GCC-compiled) | 100.00% |
Github distributed (ICPC-compiled) | 96.23% |
Compiled locally (GCC) | 98.35% |
Compiled locally (AOCC) | 94.78% |
Bwa-mem2 2.1 delivers a substantial decrease in time to completion for a three-fold increase in memory use. The three-fold increase in memory seems big but for the human genome this means going from 6.3 to 19 GB of memory usage. Given the circumstances in which bwa-mem2 is most practical (8 threads or higher for WGS samples) the core-to-thread ratio of 2.5G or lower per thread is very reasonable.
Bwa-mem2 successfully utilizes SIMD instructions which means that it performs much better on more modern architectures. However this should also be in reach for most institutions as AVX2 was introduced with Intel's Haswell architecture in 2013. AVX512 could not be benchmarked on the test machine. It was introduced in Skylake architecture server processors in 2017.
The bwa-mem2 README's claim that the Intel-compiled binaries perform better is verified even on this AMD test machine. However, the improvement is about 4%. Compiling locally using an open-source compiler gives a smaller improvement of less than 2%. Using a proprietary compiler that was optimized by the CPU vendor (AMD) led to a performance increase of 5%. These percentages are very small.
The bioconda distribution comes with numerous advantages such as an open toolchain and automatic update of a biocontainer for reproducibility purposes. It is debatable whether small performance increases due to using better compilers are worth losing these perks.
These conclusions apply mostly to the tested platform (AMD Ryzen 5 3600) and other processors may yield different results.
There are three contenders: samtools sort
, sambamba sort
and
picard SortSam
. Testing was performed on a 760 MB unsorted BAM file. Memory
settings and MAX_RECORDS_IN_RAM were set to low levels to simulate a
WGS scenario where the resulting files are so big they cannot possible kept in
memory and need to be split in multiple files.
Samtools sort and samtools index were run in the same run because sambamba and picard can index on the fly. So indexing needs to be done as well for fair comparison.
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@0 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai'" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@0 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai' Time (mean ± σ): 41.780 s ± 0.339 s [User: 37.899 s, System: 1.833 s] Range (min … max): 41.460 s … 42.287 s 5 runs
For picard we use VALIDATION_STRINGENCY=SILENT
. We can assume the aligner
outputs correct BAM formatted records. The tool only needs coordinates or names
depending on the sort order.
$ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx3G -XX:ParallelGCThreads=1 SortSam INPUT=unsorted.bam OUTPUT=test.bam CREATE_INDEX=true MAX_RECORDS_IN_RAM=300000 VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate COMPRESSION_LEVEL=1' Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx3G -XX:ParallelGCThreads=1 SortSam INPUT=unsorted.bam OUTPUT=test.bam CREATE_INDEX=true MAX_RECORDS_IN_RAM=300000 VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate COMPRESSION_LEVEL=1 Time (mean ± σ): 33.420 s ± 0.369 s [User: 46.486 s, System: 1.750 s] Range (min … max): 33.152 s … 34.067 s 5 runs
For comparison here is picard SortSam
with the default validation:
$ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx3G -XX:ParallelGCThreads=1 SortSam INPUT=unsorted.bam OUTPUT=test.bam CREATE_INDEX=true MAX_RECORDS_IN_RAM=300000 SORT_ORDER=coordinate COMPRESSION_LEVEL=1' Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx3G -XX:ParallelGCThreads=1 SortSam INPUT=unsorted.bam OUTPUT=test.bam CREATE_INDEX=true MAX_RECORDS_IN_RAM=300000 SORT_ORDER=coordinate COMPRESSION_LEVEL=1 Time (mean ± σ): 39.163 s ± 0.372 s [User: 55.183 s, System: 1.956 s] Range (min … max): 38.688 s … 39.728 s 5 runs
Sambamba has an interface similar to samtools and was run with the same settings.
hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba sort -t0 -m 128M -l 1 -o test.bam unsorted.bam' Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba sort -t0 -m 128M -l 1 -o test.bam unsorted.bam Time (mean ± σ): 89.088 s ± 0.770 s [User: 85.938 s, System: 1.622 s] Range (min … max): 87.847 s … 89.940 s 5 runs
picard SortSam
seems to be the fastest but a quick look at user time shows
that it is slower than samtools sort
. picard
seems to implement some
multithreading. Also the resulting file size for
samtools
(744 MB) is smaller than that of picard
(960 MB).
Since the sorting tool should be used in a pipe behind an aligner, a sort
tool that uses the least CPU time is preferred, as more time can go towards the
alignment. Also the output filesize is preferably small. samtools
is the
best tool to use here.
Sambamba is not in the same ball park as the other two tools with regards to sorting and should therefore not be considered.
1
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@0 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai'" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@0 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai' Time (mean ± σ): 42.022 s ± 0.319 s [User: 38.012 s, System: 1.860 s] Range (min … max): 41.720 s … 42.539 s 5 runs
2
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@2 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai'" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@2 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai' Time (mean ± σ): 23.124 s ± 0.244 s [User: 41.531 s, System: 3.432 s] Range (min … max): 22.764 s … 23.355 s 5 runs
3
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@3 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai'" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools sort -@3 -m 128M -l 1 -o test.bam unsorted.bam && samtools index test.bam test.bai' Time (mean ± σ): 19.528 s ± 1.319 s [User: 43.423 s, System: 3.657 s] Range (min … max): 18.475 s … 21.207 s 5 runs
Using additional threads decreases the wall clock time but increases the total CPU time.
All sorting tools work in the following way:
- A file is read in. The reads are sorted in a in-memory buffer.
- Once the buffer is full, it is written to disk to a tmp file.
- Once the entire file is read all tmp files and the memory buffer are merged.
If the sorting tool can hold the entire BAM into memory then no disk I/O is needed, giving significantly better performance.
When the BAM file is bigger than the in-memory buffer, part of it will be written to disk. In WGS the BAM file to be sorted is usually very big. 160GB for a level 1 compressed BAM file is not uncommon. Sorting this file with a 4GB in-memory buffer will create ~150 ~1GB temporary files (these are compressed 4GB bam files).
Increasing memory does only affect the number of temporary files written. The number of temporary files does not have a significant impact on the time as most of the time is spent sorting. The number of temporary files written is important, is the maximum number of open file handles may be reached. Using a default in the 2-4GB range seems reasonable for sorting WGS BAM data.
The BAM should be sorted directly after alignment using a unix pipe. Writing the BAM to a file and then using sort afterwards is a waste. The sort algorithm will chunk up the bam file in sorted small bam files before merging these in the resulting bam file. Therefore a sort algorithm will write the entire bam file to disk twice. To not use a pipe from the aligner will increase that to three times. Also additional time will be needed to compress and decompress the file from disk.
Samtools, picard and sambamba can all mark duplicate reads. Samtools
requires a more complex pipeline. GATK may have some requirements on how the
duplicates are marked. Therefore picard MarkDuplicates
is a good candidate
as well as sambamba markdup
which promises to work according to the
picard criteria.
For testing the samtools sorted test.bam was used.
$ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics' Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics Time (mean ± σ): 51.616 s ± 0.325 s [User: 59.094 s, System: 1.549 s] Range (min … max): 51.168 s … 51.959 s 5 runs
hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam Time (mean ± σ): 86.467 s ± 0.567 s [User: 83.899 s, System: 1.494 s] Range (min … max): 86.023 s … 87.431 s 5 runs
Sambamba requires more CPU seconds 84 vs 59 for picard. But, the Picard file is significantly bigger 960M vs 766M. That's a big difference, especially when handling big WGS files. This can be multiple gigabytes.
Picard uses the intel deflater by default which gives very large files for compression level 1. We can also use the jdk deflater which should yield the same filesize as sambamba.
$ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics USE_JDK_INFLATER=true USE_JDK_DEFLATER=true' && du -h markdup.bam Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics USE_JDK_INFLATER=true USE_JDK_DEFLATER=true Time (mean ± σ): 65.913 s ± 0.449 s [User: 73.844 s, System: 1.458 s] Range (min … max): 65.480 s … 66.503 s 5 runs
This generates files of 765M which is virtually the same as the 766M by sambamba. But compute time for picard (74 vs 89) is better.
If you feel better using the intel deflater and inflater compression level 3 yields similar results.
$ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=2 METRICS_FILE=markdup.metrics' && du -h markdup.bam Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=2 METRICS_FILE=markdup.metrics Time (mean ± σ): 47.503 s ± 0.661 s [User: 54.885 s, System: 1.463 s] Range (min … max): 46.647 s … 48.283 s 5 runs 960M markdup.bam $ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=3 METRICS_FILE=markdup.metrics' && du -h markdup.bam Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=3 METRICS_FILE=markdup.metrics Time (mean ± σ): 76.495 s ± 0.268 s [User: 85.189 s, System: 1.381 s] Range (min … max): 76.171 s … 76.848 s 5 runs 742M markdup.bam
It is slightly slower 77 vs 74 seconds with a slightly smaller bam file 753M vs 765M.
For comparison here is picard's execution time with default settings.
$ hyperfine -r 2 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 METRICS_FILE=markdup.metrics' Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 METRICS_FILE=markdup.metrics Time (mean ± σ): 103.899 s ± 0.180 s [User: 111.700 s, System: 1.445 s] Range (min … max): 103.772 s … 104.026 s 2 runs
Which generates a bam file of 711M.
But sambamba has a multithreaded advantage. How does it scale?
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam Time (mean ± σ): 87.504 s ± 0.489 s [User: 84.816 s, System: 1.547 s] Range (min … max): 86.847 s … 88.011 s 5 runs $ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 2 -l 1 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 2 -l 1 test.bam markdup.bam Time (mean ± σ): 51.343 s ± 0.312 s [User: 87.543 s, System: 1.984 s] Range (min … max): 50.995 s … 51.812 s 5 runs $ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 3 -l 1 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 3 -l 1 test.bam markdup.bam Time (mean ± σ): 42.527 s ± 0.287 s [User: 95.170 s, System: 2.402 s] Range (min … max): 42.220 s … 42.943 s 5 runs $ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 1 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 1 test.bam markdup.bam Time (mean ± σ): 37.447 s ± 0.494 s [User: 99.267 s, System: 2.478 s] Range (min … max): 37.000 s … 38.188 s 5 runs
Sambamba uses some threads effectively to its advantage, however diminishing returns hit quickly. At 4 threads CPU time/ wall clock time is less than 3. Meaning that 1 thread is not sufficiently utilized.
Maybe the scaling gets better when using better compression?
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 4 test.bam markdup.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 4 test.bam markdup.bam Time (mean ± σ): 43.028 s ± 0.428 s [User: 118.324 s, System: 2.617 s] Range (min … max): 42.612 s … 43.727 s 5 runs
CPU time/ wall clock time is still less than 3. So it is not possible to fully utilize the 4 threads. Requiring heavier compression does not increase thread utilization.
Since x threads do not utilize x cores, we can see what the effect would be if we run sambamba with x threads on a 2-core machine by using taskset.
$ hyperfine -w 2 -r 5 "taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam" Benchmark #1: taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam Time (mean ± σ): 81.394 s ± 0.252 s [User: 78.575 s, System: 1.279 s] Range (min … max): 81.172 s … 81.828 s 5 runs $ hyperfine -w 2 -r 5 "taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 2 -l 1 test.bam markdup.bam" Benchmark #1: taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 2 -l 1 test.bam markdup.bam Time (mean ± σ): 48.222 s ± 0.125 s [User: 80.145 s, System: 1.734 s] Range (min … max): 48.091 s … 48.380 s 5 runs $ hyperfine -w 2 -r 5 "taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 3 -l 1 test.bam markdup.bam" Benchmark #1: taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 3 -l 1 test.bam markdup.bam Time (mean ± σ): 45.894 s ± 0.230 s [User: 79.484 s, System: 1.940 s] Range (min … max): 45.651 s … 46.204 s 5 runs $ hyperfine -w 2 -r 5 "taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 1 test.bam markdup.bam" Benchmark #1: taskset -c 0,1 singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 4 -l 1 test.bam markdup.bam Time (mean ± σ): 47.583 s ± 0.085 s [User: 80.978 s, System: 2.030 s] Range (min … max): 47.455 s … 47.666 s 5 runs
We can see that using 3 threads on 2 cores leads to te lowest wall clock time. Being able to better utilize the CPU resources.
Picard .. code-block:
Intel compression level 5 $ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 METRICS_FILE=markdup.metrics' && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 METRICS_FILE=markdup.metrics Time (mean ± σ): 202.168 s ± 6.038 s [User: 241.791 s, System: 8.639 s] Range (min … max): 191.935 s … 207.536 s 5 runs 711Mi Intel compression level 1 $ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics' && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics Time (mean ± σ): 112.341 s ± 4.293 s [User: 159.056 s, System: 7.862 s] Range (min … max): 106.777 s … 117.376 s 5 runs 960Mi Intel compression level 2 $ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=2 METRICS_FILE=markdup.metrics' && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=2 METRICS_FILE=markdup.metrics Time (mean ± σ): 124.673 s ± 6.990 s [User: 167.482 s, System: 11.114 s] Range (min … max): 112.443 s … 129.351 s 5 runs 960Mi Intel compression level 3 $ hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=3 METRICS_FILE=markdup.metrics' && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=3 METRICS_FILE=markdup.metrics Time (mean ± σ): 163.478 s ± 5.353 s [User: 204.590 s, System: 8.764 s] Range (min … max): 160.655 s … 173.039 s 5 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. 742Mi JDK level 1 hyperfine -w 2 -r 5 'singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics USE_JDK_INFLATER=true USE_JDK_DEFLATER=true' && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/picard:2.23.1--h37ae868_0 picard -Xmx4G -XX:ParallelGCThreads=1 MarkDuplicates INPUT=test.bam OUTPUT=markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 METRICS_FILE=markdup.metrics USE_JDK_INFLATER=true USE_JDK_DEFLATER=true Time (mean ± σ): 147.671 s ± 5.456 s [User: 194.152 s, System: 8.623 s] Range (min … max): 141.741 s … 156.001 s 5 runs 765Mi
sambamba
hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam" && stat -c %s markdup.bam | numfmt --to=iec-i Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba markdup -t 1 -l 1 test.bam markdup.bam Time (mean ± σ): 143.724 s ± 1.610 s [User: 136.320 s, System: 4.303 s] Range (min … max): 141.083 s … 145.420 s 5 runs 766Mi
Using picard with compression level 1 and defaults will yield a very big file which is undesirable. When the jdk deflater is utilized the file size is similar to sambamba and we can make an apples-to-apples comparison.
When utilizing a single core picard is most efficient with 74 seconds vs 84 for sambamba. Sambamba can use more cores however and finish in a wall clock time of 51 seconds when utilizing 2 cores. Beyond that it does not scale well.
A thing to consider is the memory usage. Sambamba will not use more than 3 gigabyte of memory on its default settings. Broad institute runs Picard MarkDuplicates on VMs with 7 GB.
Another thing to consider is that Picard outputs a metrics file stating how many reads were marked as duplicate.
It is not a decisive win for any application.
. | Picard | sambamba 1 core | sambamba 2 cores |
---|---|---|---|
cores | 1 | 1 | 2 |
memory | 7G | 3G | 3G |
time | 74s | 84s | 51s |
metrics | yes | no | no |
Samtools, sambamba and picard all provide tools for merging several sorted bam files together.
Two test BAMs with each 5 million reads were generated. The test BAMs were distinct from each other. Both BAM files were sorted using samtools.
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools merge -f -l1 -@0 merged.bam test.bam test2.bam && samtools index merged.bam'" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/samtools:1.10--h9402c20_2 bash -c 'samtools merge -f -l1 -@0 merged.bam test.bam test2.bam && samtools index merged.bam' Time (mean ± σ): 45.746 s ± 0.223 s [User: 42.968 s, System: 1.143 s] Range (min … max): 45.547 s … 46.012 s 5 runs
$ hyperfine -w 2 -r 5 "singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba merge -l1 -t0 merged.bam test.bam test2.bam" Benchmark #1: singularity exec -eip docker://quay.io/biocontainers/sambamba:0.7.1--h148d290_2 sambamba merge -l1 -t0 merged.bam test.bam test2.bam Time (mean ± σ): 91.850 s ± 1.192 s [User: 87.480 s, System: 2.778 s] Range (min … max): 90.252 s … 93.187 s 5 runs
Pipes can be used to couple BWA with the postalt.js script and samtools sort in order to let the alignment process output a sorted BAM file. This massively limits the amount of files written to disk which is a great advantage.
A disadvantage of pipes is that they are blocking when read while empty or while written and full. The default pipe size on Linux is 64K. It is possible for a pipe to get clogged up repeatedly and thus lengthen computation time.
These scripts were used:
- 8 cores BWA, postalt script, samtools sort 3 cores. Sequentially.
- 8 cores BWA, postalt script, samtools sort 3 cores. Pipes.
- 8 cores BWA, postalt script, samtools sort 3 cores. Named pipes.
- 8 cores BWA, postalt script, samtools sort 3 cores. Named pipes altered with [mkbigfifo](https://github.com/biowdl/mkbigfifo).
- 12 cores BWA, postalt script, samtools sort 4 cores. Sequentially.
- 12 cores BWA, postalt script, samtools sort 4 cores. Pipes.
- 12 cores BWA, postalt script, samtools sort 4 cores. Named pipes.
- 12 cores BWA, postalt script, samtools sort 4 cores. Named pipes altered with [mkbigfifo](https://github.com/biowdl/mkbigfifo).
$ hyperfine -w 2 -r 10 'bash bwa_no_pipes_8cores.sh' Benchmark #1: bash bwa_no_pipes.sh Time (mean ± σ): 205.159 s ± 1.112 s [User: 1186.185 s, System: 8.004 s] Range (min … max): 203.323 s … 206.748 s 10 runs $ du -h ramdisk/* 435M ramdisk/no_pipes.aln.bam 2,4G ramdisk/no_pipes.postalt.sam 2,2G ramdisk/no_pipes.sam
$ hyperfine -w 2 -r 10 'bash bwa_with_pipes_8cores.sh' Benchmark #1: bash bwa_with_pipes.sh Time (mean ± σ): 171.717 s ± 0.520 s [User: 1240.633 s, System: 9.746 s] Range (min … max): 170.844 s … 172.695 s 10 runs $ du -h ramdisk/with_pipes.aln.bam 435M ramdisk/with_pipes.aln.bam
$ hyperfine -w 2 -r 10 'bash bwa_fifo_8cores.sh' Benchmark #1: bash bwa_fifo_8cores.sh Time (mean ± σ): 171.638 s ± 1.279 s [User: 1247.522 s, System: 10.922 s] Range (min … max): 170.079 s … 174.891 s 10 runs
$ hyperfine -w 2 -r 10 'bash bwa_mkbigfifo_8cores.sh' Benchmark #1: bash bwa_mkbigfifo_8cores.sh Time (mean ± σ): 170.907 s ± 2.347 s [User: 1240.188 s, System: 6.786 s] Range (min … max): 168.868 s … 175.535 s 10 runs
$ hyperfine -w 2 -r 10 'bash bwa_no_pipes_12cores.sh' Benchmark #1: bash bwa_no_pipes_12cores.sh Time (mean ± σ): 155.212 s ± 0.614 s [User: 1271.680 s, System: 8.780 s] Range (min … max): 154.406 s … 156.669 s 10 runs $ du -h ramdisk/* 435M ramdisk/no_pipes.aln.bam 2,4G ramdisk/no_pipes.postalt.sam 2,2G ramdisk/no_pipes.sam
$ hyperfine -w 2 -r 10 'bash bwa_with_pipes_12cores.sh' Benchmark #1: bash bwa_with_pipes_12cores.sh Time (mean ± σ): 123.853 s ± 0.928 s [User: 1295.435 s, System: 10.535 s] Range (min … max): 122.943 s … 125.636 s 10 runs $ du -h ramdisk/with_pipes.aln.bam 435M ramdisk/with_pipes.aln.bam
$ hyperfine -w 2 -r 10 'bash bwa_fifo_12cores.sh' Benchmark #1: bash bwa_fifo_12cores.sh Time (mean ± σ): 122.921 s ± 0.541 s [User: 1288.907 s, System: 11.465 s] Range (min … max): 122.244 s … 123.700 s 10 runs
$ hyperfine -w 2 -r 10 'bash bwa_mkbigfifo_12cores.sh' Benchmark #1: bash bwa_mkbigfifo_12cores.sh Time (mean ± σ): 125.581 s ± 0.955 s [User: 1303.329 s, System: 8.499 s] Range (min … max): 123.360 s … 126.660 s 10 runs