Skip to content

nicolaprezza/cw-dBg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cw-dBg: the compressed weighted de Bruijn graph

Author: Nicola Prezza. Joint work with Giuseppe Italiano, Blerina Sinaimeri, Rossano Venturini

Description

Warning: experimental code. Runs only on small files. dBg construction is not yet optimized and requires 22 Bytes per input base (in RAM) at the moment

This library builds a compressed representation of the weighted de Bruijn graph. The underlying graph topology is stored using the BOSS representation, while the weights are differentially encoded and sampled on a spanning tree of the graph chosen to minimize the total bit-size of the structure. Results show that on a 20x-covered dataset with 27M distinct kmers (700Mbases in total), the whole structure takes 5.44 bits per kmer (just 18 MB in total).

Download

To clone the repository, run:

git clone http://github.com/nicolaprezza/cw-dBg

Compile

The library has been tested under linux using gcc 9.2.1. You need the SDSL library installed on your system (https://github.com/simongog/sdsl-lite).

We use cmake to generate the Makefile. Create a build folder in the main cw-dBg folder:

mkdir build

run cmake:

cd build; cmake ..

and compile:

make

Run

After compiling, run

cw-dBg-build [-l nlines] [-a] [-s srate] input k

to build the compressed weighted de Bruijn graph of order k on the file input (a fastq file by default, or a fasta file if option -a is specified). if option -l nlines is specified, build the graph using only the first nlines sequences from the input file. If option -s srate is specified, sample one out of srate weights (default: srate=64).

The tool cw-dBg-check allows to benchmark the data structure previously built as follows:

cw-dBg-check [options] <input_index> <input_fastx>

Options:
    -q Extract and test the structure on the first maximum k-mers in the dataset. Default: 1000000
    -a The input file is fasta. If not specified, it is assumed that the input file is fastq.
    -c Check correctness of the structure against a classic hash (space-consuming!!). Default: false.
    <input_index> Input index built with cw-dbg-build. Mandatory.
    <input_fastx> Fasta/fastq file from which test kmers will be extracted. Must be the same on which the index was built.

About

Compressed weighted de Bruijn graph

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published