Skip to content

BWT Text Indexing Library: a set of tools to work with BWT-based text indexes

Notifications You must be signed in to change notification settings

nicolaprezza/BWTIL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BWTIL : the BWT Text-indexing Library

Authors: Nicola Prezza, Nicola Gigante, and Alberto Policriti. mail: [email protected]

Brief introduction

The BWTIL library offers a set of tools, classes and functions designed to facilitate operations that involve manipulating the Burrows-Wheeler text transform (example: computing/inverting the BWT or checking its correctness) and indexes based on the BWT. The library is specifically designed to deal with large files (which means up to 2^64 characters) and implements efficient algorithms and data structures to offer good time-and-space performances. BWTIL comes with some sample programs and classes, among which you can find static and dynamic bitvectors, Huffman-compressed dynamic strings, partial sums, Huffman trees, static wavelet trees, the dB-hash data structure (ISAAC2014 paper: http://link.springer.com/chapter/10.1007%2F978-3-319-13075-0_13), a succinct wavelet-tree based FM index, and a novel BWT construction algorithm (LATA2015 paper: http://link.springer.com/chapter/10.1007/978-3-319-15579-1_46). The available executables (to date) are:

Download

Since BWTIL includes extern git repositories as submodules, clone it using the --recursive option:

git clone --recursive https://github.com/nicolaprezza/BWTIL

Compile

The library has been extensively tested under linux using gcc 4.8.2 and clang++ 3.5. We use cmake to generate the Makefile:

To compile, firstly create and enter the bin/ directory

mkdir bin; cd bin

Then, launch cmake as (default build type is release):

cmake ..

Finally, build the executables:

make

The above command creates the executables in the bin directory.

Input formats

  • plain text files: ASCII-coded. However, the byte 0x0 must NOT appear inside the text since the algorithms use 0x0 as text terminator.
  • bwt files: ASCII-coded, with a UNIQUE 0x0 byte (terminator character) appearing somewhere inside the text. Be aware that, if the input bwt file is malformed, the programs will fail with a error message.
  • suffix array files: each SA address is stored as a 64-bit (8 byte) integer. Despite the fact that this coding may not be optimal for small texts, you can compress the SA files to reduce its size.

About

BWT Text Indexing Library: a set of tools to work with BWT-based text indexes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages