-
Notifications
You must be signed in to change notification settings - Fork 13
Fast Locate
The r-index is a fast locate()
structure that can be used to augment the compressed GBWT. It is defined as class FastLocate
in fast_locate.h.
The implementation is based on the original r-index described in:
Travis Gagie, Gonzalo Navarro, and Nicola Prezza: Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 2020. DOI: 10.1145/3375890
The r-index stores one suffix array sample at the start and end of each logical run in the BWT. (A logical run is either a maximal run of actual nodes or a single occurrence of the endmarker node in the BWT.) In contrast, the default locate()
structure stores the path identifiers (samples the document array) at fixed intervals. Because the r-index is based on global properties of the sequences, it must be rebuilt whenever the GBWT changes.
In a typical case, the r-index uses 3-4 times more space than the BWT itself and much more than the default locate()
structure with sample rate 1024. In a 1000GP GBWT, it can report roughly 1.5 million occurrences/second. In contrast, the default structure can report ~10 thousand individual occurrences/second or ~100 thousand occurrences/second when reporting ranges of occurrences.
R-index version | GBWT version | Changes |
---|---|---|
1 | v1.1 | First version. |
explicit FastLocate(const GBWT& source)
The construction is very straightforward. It extracts all sequences from the GBWT using multiple OMP threads and stores the suffix array samples at the start/end of each logical run. After the extraction has finished, the construction sorts and compresses the samples.
By default, the construction writes some statistics to stderr
. This can be adjusted with the verbosity settings.
Like other GBWT structures, the r-index uses the SDSL interface for serialization and loading.
size_type serialize(std::ostream& out, sdsl::structure_tree_node* v = nullptr, std::string name = "") const;
void load(std::istream& in);
The r-index cannot be used without the GBWT. After the index has been loaded, the GBWT must be set explicitly.
void setGBWT(const GBWT& source)
build_ri
is a simple tool for building the r-index.
build_ri [options] base_name
The tool reads base_name.gbwt
, builds the r-index, and writes it to base_name.ri
.
-
-t N
: UseN
parallel extraction threads. -
-v
: Verify the r-index. This requires the original sequences in filebase_name
.
The r-index is a structure that enumerates the suffixes of the sequences in lexicographic order. Because the GBWT is an FM-index of the reverse sequences, the implementation actually enumerates the reverse prefixes.
For best results, use the r-index version of find()
that returns the first occurrence in the range in addition to the range itself. The occurrence can then be given as a seed to a locate()
query. If no seed is given, the locate()
query uses the start of the first run overlapping with the range as a seed. This may cause significant slowdown if the runs are long but the range is short. In such cases, it may be better to repeat the search for the final pattern using the r-index find()
.
The r-index uses the following variants of the find()
/ extend()
queries (see Search Interface).
SearchState find(node_type node, size_type& first) const;
template<class Iterator> SearchState find(Iterator begin, Iterator end, size_type& first) const;
SearchState extend(SearchState state, node_type node, size_type& first) const;
template<class Iterator> SearchState extend(SearchState state, Iterator begin, Iterator end, size_type& first) const;
If the search is successful (the returned state is non-empty), first
will contain the first occurrence in the range.
The r-index version locate()
can optionally take the first occurrence as a seed. If no seed is given, the query will be slower. See also Search Interface.
std::vector<size_type> locate(SearchState state, size_type first = NO_POSITION) const;
std::vector<size_type> locate(node_type node, range_type range, size_type first = NO_POSITION) const;
Example:
size_type first;
SearchState state = r_index.find(pattern.begin(), pattern.end(), first);
std::vector<size_type> result = r_index.locate(state, first);
This searches for vector_type pattern
in FastLocate r_index
and stores the identifiers of the paths containing it in result
.
The r-index can also decompress the part of the suffix array or the document array corresponding to a node. This is almost the same as locate()
, but the duplicates are not filtered out and the results are returned in order.
std::vector<size_type> decompressSA(node_type node) const;
std::vector<size_type> decompressDA(node_type node) const;
Example:
std::vector<size_type> result = r_index.decompressSA(node);
for(size_type i = 0; i < result.size(); i++)
{
result[i] = r_index.seqId(result[i]);
}
This is equivalent to decompressDA(node)
.
Example:
std::vector<size_type> result = r_index.decompressDA(node);
removeDuplicates(result, false);
This is equivalent to r_index.locate(r_index.index->find(node))
.
A suffix array value (offset
) encodes both a path identifier (seq_id
) and a sequence offset (seq_offset
) in size_type
. As a result, the product of the number of sequences and the length of the longest sequence must fit in 64 bits.
size_type pack(size_type seq_id, size_type seq_offset) const;
size_type seqId(size_type offset) const;
size_type seqOffset(size_type offset) const;
std::pair<size_type, size_type> unpack(size_type offset) const;
-
pack()
: Encode path identifier and sequence offset as a suffix array value. -
seqId()
: Extract the path identifier. -
seqOffset()
: Extract the sequence offset. -
unpack()
: Decode the path identifier and the sequence offset (in this order).
The following function is the backbone of the r-index:
size_type locateNext(size_type prev) const;
Given the value prev
at unknown suffix array offset i
, return the next value at suffix array offset i + 1
.
size_type locateFirst(node_type node) const;
This returns the first suffix array value in the given node. It can be used as a starting point for iterating over the node, as in decompressSA()
and decompressDA()
.
See also File Formats.
class FastLocate
{
Header header;
sdsl::int_vector<0> samples;
sdsl::sd_vector<> last;
sdsl::int_vector<0> last_to_run;
sdsl::int_vector<0> comp_to_run;
};
-
header
: R-index header (see below). -
samples
: The suffix array value at the start of each logical run in the BWT. In lexicographic order. -
last
: The suffix array value at the end of each logical run in the BWT. In text order, with the values marked with 1-bits. -
last_to_run
: The run identifier for each marked position inlast
. -
comp_to_run
: The cumulative number of logical runs before each record in the BWT.
struct FastLocate::Header
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t max_length;
std::uint64_t flags;
};
-
tag
: Must be0x6B3741D8
. -
version
: R-index version. Must be1
. -
max_length
: Length of the longest sequence in the index. Used forpack()
/unpack()
. -
flags
: Must be0
.