Skip to content

Autocycler trim

Ryan Wick edited this page Dec 18, 2024 · 28 revisions

Basics

Long-read contigs sometimes contain excess sequence which needs to be trimmed off. This often occurs with circular sequences (typical for bacteria) where there is overlap between the start and end of a contig. For small plasmids, this can sometimes involve the entire sequence being duplicated (or more) in a single contig. Linear sequences with hairpin ends can also create excess sequence which needs to be trimmed.

This step is performed on a per-cluster basis. For each contig, Autocycler looks for two types of excess sequence: start-end circular overlap and hairpin overlap. Contigs are trimmed as necessary. After trimming, Autocycler discards any contigs with outlier lengths, so the remaining contigs in the cluster will all be in close agreement with each other.

Example command

for c in autocycler_out/clustering/qc_pass/cluster_*; do
    autocycler trim -c "$c"
done

Autocycler trim is typically run on each of the QC-pass clusters, so the command above is in a Bash loop.

This command takes a cluster directory (represented by "$c" in the example) which must contain a 1_untrimmed.gfa file (created by the previous step, Autocycler cluster). It will create a 2_trimmed.gfa file in the cluster directory.

Full usage

Usage: autocycler trim [OPTIONS] --cluster_dir <CLUSTER_DIR>

Options:
  -c, --cluster_dir <CLUSTER_DIR>    Autocycler cluster directory containing 1_untrimmed.gfa file
                                     (required)
      --min_identity <MIN_IDENTITY>  Minimum alignment identity for trimming alignments [default: 0.75]
      --max_unitigs <MAX_UNITIGS>    Maximum unitigs used for overlap alignment, set to 0 to disable
                                     trimming [default: 5000]
      --mad <MAD>                    Allowed variability in cluster length, measured in median absolute
                                     deviations, set to 0 to disable exclusion of length outliers
                                     [default: 5.0]
  -t, --threads <THREADS>            Number of CPU threads [default: 8]
  -h, --help                         Print help
  -V, --version                      Print version

Notes

  • --max_unitigs 0 will turn off trimming. This is appropriate if you manually trimmed the sequences before running Autocycler.
  • Using --mad 0 will turn off outlier exclusion, so all contigs will be included in the output 2_trimmed.gfa file.
  • Autocycler trim cannot distinguish between artifactual and genuine duplications. E.g. if a plasmid really is doubled, Autocycler trim will still cut it down to a single copy.
  • The 2_trimmed.gfa graph made by Autocycler trim contains GFA path lines for the trimmed input contigs, so it can be used with Autocycler decompress.
  • The median absolute deviation is a good measure of how well trimming went. Ideally, this will be low (<50 bp or so). If it's high (e.g. 1000 bp), that suggests a lot of inconsistency in the contigs, which is a red flag.
  • Since identity is calculated using unitigs not bases (see implementation below), the default value for --min_identity of 0.75 represents a relatively strong alignment. That parameter can be reduced to much lower levels (e.g. --min_identity 0.25) to find weaker alignments.

Implementation details

Autocycler trim uses a dynamic-programming alignment algorithm, similar to Needleman-Wunsch and Smith-Watermanbut. It essentially performs a self-vs-self overlap alignment, with the following modifications:

  • Instead of aligning bases directly, it aligns the unitig path, which greatly improves performance. Scores are weighted by the length of the unitig (matches are positive scores, mismatches and indels are negative scores).
  • The matrix diagonal cells are always set to zero, preventing the algorithm from choosing the whole-vs-whole alignment.
  • The final chosen alignment can only extend from the top edge of the matrix to the right edge of the matrix.
  • The alignment matrix is limited in size (using --max_unitigs) to improve performance on sequences with a large number of unitigs.
  • Alignment identity is defined as the total match score divided by the alignment length. Since the alignment is based on unitigs, this identity will be lower than a traditional nucleotide alignment.

Optional manual intervention

  • You can run Autocycler dotplot before/after trimming to inspect:
    • How the sequences relate to each other.
    • Whether trimming was successful.
    • Whether linear sequences were sufficiently long to trim (see Linear sequences).

Toy example

(The toy example is introduced on the Autocycler compress page.)

Of all the input contigs in the toy example, only b1 and b2 contain overlap. These are their paths through the unitig graph, with the overlap highlighted and the overlap-free trimmed paths below:

Autocycler trimming paths

Note that the alignment does not need to be exact. Contig b2 has a variant in its overlap (unitig 38 vs 34), but the alignment is still sufficiently high identity for trimming to occur.

After trimming is complete, the cluster graph is simplified and saved as 2_trimmed.gfa. The result is similar to the untrimmed graph but slightly simpler due to the removed pieces:

trimmed cluster graphs