-
Notifications
You must be signed in to change notification settings - Fork 13
Removing Sequences
Removing sequences from GBWT is similar to inserting them. The implemented algorithm is an in-memory variant of the parallel merging algorithm. Multiple search threads search for the sequences to be removed, building the rank array in memory. The positions specified by the rank array are then removed from the index. Because the uncompressed rank array is stored in memory, requiring temporarily up to tens of bytes times the total length of the sequences, the algorithm is mostly suited for removing a small number of sequences.
Various cases:
- If the index is bidirectional, any request to remove sequence
N
will actually remove sequencesPath::encode(N, false)
andPath::encode(N, true)
. Otherwise sequenceN
will be removed instead. - The set of sequence identifiers may contain duplicates, as they are removed during preprocessing.
- If at least one of the specified sequence identifiers is invalid, no sequences are removed.
Sequences can be removed with remove_seq
.
remove_seq [options] base_name seq1 [seq2 ...]
The program reads base_name.gbwt
, removes the sequences with identifiers seq1
, seq2
, ... The output is written back to base_name.gbwt
, unless specified otherwise. If no sequences were removed, the output will not be written.
-
-c N
: Use chunks ofN
sequences per search thread. -
-o X
: Write the output toX.gbwt
. -
-O
: Output SDSL format instead of simple-sds format. -
-r
: Remove the range of sequencesseq1
toseq2
(inclusive). Requries exactly two sequence arguments. -
-S
: Remove all sequences for the sample nameseq1
. Requires metadata with sample and path names. -
-C
: Remove all sequences for the contig nameseq1
. Requires metadata with contig and path names.
If the index contains metadata with path names, sequences can only be removed by sample/contig names.
Example: remove_seq -r -o output input 11 20
Reads input.gbwt
, removes sequences 11 to 20, and writes the result to output.gbwt
.
The following member functions of DynamicGBWT
remove sequences from the index. The return value is the total length of the removed sequences, or 0 if no sequences were removed.
size_type remove(size_type seq_id, size_type chunk_size = REMOVE_CHUNK_SIZE);
size_type remove(const std::vector<size_type>& seq_ids, size_type chunk_size = REMOVE_CHUNK_SIZE);
-
seq_id
: Identifier of the sequence. -
seq_ids
: Set of sequence identifiers to be removed. -
chunk_size
: Use chunks of this many sequences per search thread.
Example:
DynamicGBWT index;
sdsl::load_from_file(index, input_name);
std::vector<size_type> to_remove;
for(size_type i = 11; i <= 20; i++) { to_remove.push_back(i); }
index.remove(to_remove);
sdsl::store_to_file(index, output_name);
This reads the index from file input_name
, removes sequences 11 to 20, and writes the resulting index to file output_name
.