-
Notifications
You must be signed in to change notification settings - Fork 13
Data Model
Graph BWT (GBWT) is a substring index for collections of similar paths in a graph. It is implemented as a multi-string BWT of the node sequences. Hence GBWT is a variant of the graph extension (gPBWT) of the positional BWT (PBWT).
We build BWT for the reverse sequences, or equivalently sort the reverse prefixes of the paths in lexicographic order. As a result, LF-mapping moves forward on the paths.
- We store arbitrary paths in a general graph.
- The collection of paths is highly repetitive, making the BWT compressible with run-length encoding.
- There are less than 2^32 nodes in the graph and less than 2^32 occurrences of each node in the collection.
- The set of node identifiers is a dense subset of unsigned 32-bit integers in some range [a,b].
- Each path is terminated by endmarker node 0, which is not a real node of the graph.
- Path identifiers are integers starting from 0 denoting the order of insertion.
typedef sdsl::int_vector<0> text_type;
typedef sdsl::int_vector_buffer<0> text_buffer_type;
The input is a concatenation of 0-terminated integer sequences. It is stored as an SDSL integer vector: either in memory as int_vector<0>
or on disk as int_vector_buffer<0>
. In-memory construction also works with std::vector<node_type>
.
GBWT construction is incremental. We insert a batch of sequences into a dynamic FM-index, extending each sequence in the batch simultaneously by one node.
As an FM-index, GBWT supports many kinds of queries. The following have been implemented so far:
- Extract a path, given its identifier.
- Find the lexicographic range of subpaths matching the pattern string.
- Determine the identifier of a path, given the lexicographic rank of a reverse prefix.
Richard Durbin: Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT). Bioinformatics 30(9):1266-1272, 2014. DOI: 10.1093/bioinformatics/btu014
Adam M. Novak, Erik Garrison, and Benedict Paten: A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms for Molecular Biology 12:18, 2017. DOI: 10.1186/s13015-017-0109-9