Skip to content

Terminology and conventions

Ted Verhey edited this page Nov 9, 2017 · 11 revisions

Annotated variable regions

These are a set of intervals along the reference sequence that correspond to canonically defined variable regions. These can be specified for each reference in a database using vast annotate_vr.

Left-justified alignments

Smith-Waterman alignment of two sequences often results in multiple equivalently-scoring alignments. In VAST, alignments are represented as mappings, which is a list of insertions, deletions, and substitutions required to transform a reference sequencing to the read being mapped to it. Each sequence therefore has multiple mappings. Often, these mappings differ because of repetitive sequence, where a gap or insertion of a repeat can be located in multiple positions. Left-justification is the process of selecting the mapping that is left-justified: ie., the mapping with the most sequence density towards the left end of the read. In VAST, left-justification is achieved by selecting the mapping that has the the insertions closest to the beginning of the read, and the deletions closest to the end of the read.

Map distance

The map distance is a number representing the distance between any two mappings. For any two mappings, the map distance is the symmetric difference of bases inserted, deleted, or substituted between the mappings.

The map distance therefore has several well-defined properties: the map distance is equal for any equivalent mappings of a single read compared to the reference, and the map distance between identical mappings is zero.

Mapping

A mapping is a particular alignment of a sequence to the reference sequence, and can be represented as a list of insertions, deletions, and substitutions required to transform the reference into that sequence. Therefore, a mapping represents both the alignment and the sequence of the read being mapped. Mappings are only meaningful relative to a reference sequence.

Example

With a 44 bp reference sequence:

0         10        20        30        40
|123456789|123456789|123456789|123456789|1234
ACTGCCACTCTTTTTGCGAATCAGTTTAACCTAGGTTCAACCTTT

A mapping to the reference could be as follows:

operation position information
Insertion 3 ACC
Substitution 8 G
Substitution 9 G
Substitution 10 C
Deletion 27 9

This would lead to the following edits (shown at the *):

0            10        20        30        40
|123---456789|123456789|123456789|123456789|1234
ACTGACCCCACGGCTTTTGCGAATCAGTTT---------TCAACCTTT
    ***    ***                *********

By applying all of the operations, we can reconstruct the read sequence from its mapping:

0         10        20        30       
|123456789|123456789|123456789|12345678
ACTGACCCCACGGCTTTTGCGAATCAGTTTTCAACCTTT