Skip to content

Query Interface

Jouni Siren edited this page Nov 10, 2021 · 28 revisions

General

All queries use a single thread. Unless otherwise noted, they are member functions of GBWT and DynamicGBWT. The structures are defined in gbwt.h and dynamic_gbwt.h, respectively.

The low-level query interface assumes that the parameter values are valid. The validity can be checked with contains() (see below).

Search state

A search state combines a node identifier and a range of offsets. The state is defined in algorithms.h.

struct SearchState
{
  node_type  node;
  range_type range;

  size_type size() const;
  bool empty() const;
};
  • node: Node identifier.
  • range: Range of record offsets.
  • size_type size() const: The same as Range::length(range).
  • bool empty() const: The same as Range::empty(range).

Bidirectional search state

A bidirectional state combines the search states for a pattern and its reverse. (The reverse path corresponds to the reverse complement of the DNA sequence.) It is defined in algorithms.h.

struct BidirectionalState
{
  SearchState forward, backward;

  size_type size() const;
  bool empty() const;

  void flip();
};
  • forward: The search state for the pattern.
  • backward: The search state for the reverse of the pattern.
  • size_type size() const: The same as forward.size(), which is equal to reverse.size().
  • bool empty() const: The same as forward.empty(), which is equal to reverse.empty().
  • void flip(): Reverses the orientation of the pattern; swaps the forward and reverse states.

Index statistics

These are fast inline functions, unless otherwise noted.

  • size_type size() const: Total length of the paths.
  • bool empty() const: Is the index empty?
  • size_type sequences() const: Number of paths (sequences) in the index.
  • size_type sigma() const: Alphabet size; largest node identifier + 1.
  • size_type effective() const: Effective alphabet size; largest record identifier + 1.
  • std::pair<size_type, size_type> runs() const: Number of runs in the BWT. The first value is the number of concrete runs in the encoding, while the second value is the number of logical runs. (Each endmarker is a separate logical run.) Expensive.
  • size_type samples() const: Number of sampled path identifiers in the index. Expensive in DynamicGBWT.
  • bool bidirectional() const: Does the index support bidirectional search?

Nodes

  • bool contains(node_type node) const: Is there a record for node node; is node in the effective alphabet?
  • bool contains(edge_type position) const: As above, but also checks that the record offset is valid.
  • bool contains(SearchState state) const: As above, but also checks that the range of record offsets is non-empty and valid.
  • bool hasEdge(node_type from, node_type to) const: Is there an edge from node from to node to.
  • std::vector<edge_type> edges(node_type from) const: Returns the edges from node from.
  • node_type firstNode() const: Node identifier for the first non-endmarker node.
  • comp_type toComp(node_type node) const: Record identifier for node node.
  • node_type toNode(comp_type comp) const: Node identifier for record comp.
  • size_type nodeSize(node_type node) const: Record size for node node; the number of occurrences of node in the BWT. Relatively expensive in GBWT.
  • bool empty(node_type node) const: Is the record for node node empty? Relatively expensive in GBWT.

Navigation and searching

These operations are based on positions expressed as a combination of a node identifier and a record offset. They return either positions or offsets. Errors are expressed with invalid_edge(), invalid_offset(), or Range::empty_range().

  • edge_type LF(node_type from, size_type i) const: Move from node from offset i to the next node on the path.
  • edge_type LF(edge_type position) const: As above.
  • size_type LF(node_type from, size_type i, node_type to) const: Start from node from offset i and get the next offset, assuming that the next node is to.
  • size_type LF(edge_type position, node_type to) const: As above.
  • range_type LF(node_type from, range_type range, node_type to) const: Start from range range of offsets in node from and get the range of successors in node to.
  • range_type LF(SearchState state, node_type to) const: As above.

DynamicGBWT also supports full LF-mapping that works regardless of whether the implied edge exists.

  • size_type fullLF(node_type from, size_type i, node_type to) const: As LF() with the same parameters. If there is no edge from from to to, the return value is the number of paths from any node v to to, where v is smaller than from.
  • size_type fullLF(edge_type position, node_type to) const: As above.

Bidirectional LF-mapping

These functions require a bidirectional index.

edge_type inverseLF(node_type from, size_type i) const`
edge_type inverseLF(edge_type position) const

Inverse LF-mapping from node from offset i moves to the previous node on the path. It is several times slower than going forward.

range_type bdLF(SearchState state, node_type to, size_type& reverse_offset) const

This function returns LF(state, to). It also stores the number of occurrences of nodes smaller than to in the query range in reverse_offset. This is the starting offset of the new reverse range relative to the start of the old reverse range.

Sequences

These queries convert between positions and path identifiers. Errors are expressed with invalid_edge() and invalid_sequence().

  • edge_type start(size_type sequence) const: The position for the start of path sequence.
  • size_type tryLocate(node_type node, size_type i) const: The sampled path identifier at node node, record offset i or invalid_sequence() if there is no sample.
  • size_type tryLocate(edge_type position) const: As above.
Clone this wiki locally