Author: Nicola Prezza. Joint work with Giuseppe Italiano, Blerina Sinaimeri, Rossano Venturini
Modified by Debasish Pattanayak to add external sorting
Warning: experimental code. dBg construction uses external sorting. See results.md for estimated RAM usage.
This library builds a compressed representation of the weighted de Bruijn graph. The underlying graph topology is stored using the BOSS representation, while the weights are differentially encoded and sampled on a spanning tree of the graph chosen to minimize the total bit-size of the structure. Results show that on a 20x-covered dataset with 27M distinct kmers (700Mbases in total), the whole structure takes 5.44 bits per kmer (just 18 MB in total).
To clone the repository, run:
cd cw-dBg
The library has been tested under linux using gcc 9.2.1. You need the SDSL library installed on your system (https://github.com/simongog/sdsl-lite).
We use cmake to generate the Makefile. Create a build folder in the main cw-dBg folder:
mkdir build
run cmake:
cd build; cmake ..
and compile:
make
To run stxxl, set a system disk with desired file size in a file .stxxl in the build directory
disk=/var/tmp/stxxl,500GiB,syscall unlink
After compiling, run
cw-dBg-build [-l nlines] [-a] [-s srate] input k
to build the compressed weighted de Bruijn graph of order k on the file input (a fastq file by default, or a fasta file if option -a is specified). if option -l nlines is specified, build the graph using only the first nlines sequences from the input file. If option -s srate is specified, sample one out of srate weights (default: srate=64).
The tool cw-dBg-check allows to benchmark the data structure previously built as follows:
cw-dBg-check [options] <input_index> <input_fastx>
Options:
-q Extract and test the structure on the first maximum k-mers in the dataset. Default: 1000000
-a The input file is fasta. If not specified, it is assumed that the input file is fastq.
-c Check correctness of the structure against a classic hash (space-consuming!!). Default: false.
<input_index> Input index built with cw-dbg-build. Mandatory.
<input_fastx> Fasta/fastq file from which test kmers will be extracted. Must be the same on which the index was built.