Skip to content

Removing Sequences

Jouni Siren edited this page May 8, 2021 · 6 revisions

General

Removing sequences from GBWT is similar to inserting them. The implemented algorithm is an in-memory variant of the parallel merging algorithm. Multiple search threads search for the sequences to be removed, building the rank array in memory. The positions specified by the rank array are then removed from the index. Because the uncompressed rank array is stored in memory, requiring temporarily up to tens of bytes times the total length of the sequences, the algorithm is mostly suited for removing a small number of sequences.

Various cases:

  • If the index is bidirectional, any request to remove sequence N will actually remove sequences Path::encode(N, false) and Path::encode(N, true). Otherwise sequence N will be removed instead.
  • The set of sequence identifiers may contain duplicates, as they are removed during preprocessing.
  • If at least one of the specified sequence identifiers is invalid, no sequences are removed.

Remove tool

Sequences can be removed with remove_seq.

remove_seq [options] base_name seq1 [seq2 ...]

The program reads base_name.gbwt, removes the sequences with identifiers seq1, seq2, ... The output is written back to base_name.gbwt, unless specified otherwise. If no sequences were removed, the output will not be written.

  • -c N: Use chunks of N sequences per search thread.
  • -o X: Write the output to X.gbwt.
  • -O: Output SDSL format instead of simple-sds format.
  • -r: Remove the range of sequences seq1 to seq2 (inclusive). Requries exactly two sequence arguments.
  • -S: Remove all sequences for the sample name seq1. Requires metadata with sample and path names.
  • -C: Remove all sequences for the contig name seq1. Requires metadata with contig and path names.

If the index contains metadata with path names, sequences can only be removed by sample/contig names.

Example: remove_seq -r -o output input 11 20 Reads input.gbwt, removes sequences 11 to 20, and writes the result to output.gbwt.

Interface

The following member functions of DynamicGBWT remove sequences from the index. The return value is the total length of the removed sequences, or 0 if no sequences were removed.

size_type remove(size_type seq_id, size_type chunk_size = REMOVE_CHUNK_SIZE);
size_type remove(const std::vector<size_type>& seq_ids, size_type chunk_size = REMOVE_CHUNK_SIZE);
  • seq_id: Identifier of the sequence.
  • seq_ids: Set of sequence identifiers to be removed.
  • chunk_size: Use chunks of this many sequences per search thread.

Example:

DynamicGBWT index;
sdsl::load_from_file(index, input_name);
std::vector<size_type> to_remove;
for(size_type i = 11; i <= 20; i++) { to_remove.push_back(i); }
index.remove(to_remove);
sdsl::store_to_file(index, output_name);

This reads the index from file input_name, removes sequences 11 to 20, and writes the resulting index to file output_name.

Clone this wiki locally