Skip to content
Ryan Wick edited this page May 4, 2020 · 35 revisions

Trycycler

The problem

Long-read assembly has come a long way in the last few years, and there are many good assemblers available, including Canu, Flye, Raven and Redbean. Since bacterial genomes are relatively simple (not too large, not too many repeats), a completed assembly (one contig per replicon) is often possible from a long-read set.

But even the best assemblers are not perfect! They often fail to circularise sequences, either duplicating or omitting sequence at the start/end of a contig. They sometimes produce spurious contigs, e.g. assembling a repetitive part of the chromosome into a separate contig. They sometimes omit entire replicons, e.g. failing to include a plasmid. They sometimes create medium-scale indel errors, e.g. deleting 50 bp from the genome. And they rarely but occasionally create large-scale misassemblies, e.g. a major structural rearrangement. Check out our paper comparing long-read assemblers for a more in-depth look at how assemblers perform.

So imagine that you've done long-read sequencing of a bacterial isolate and used a good assembler to produce an assembly. The result looks like a nice completed assembly (e.g. a big circular contig for the chromosome and a couple smaller circular contigs for plasmids), but how can you be sure that it's free from the kinds of problems listed above?

The solution

Trycycler is a tool that takes as input multiple separate long-read assemblies of the same genome (e.g. from different assemblers and/or different subsets of the input reads) and produces a consensus long-read assembly.

In brief, Trycycler does the following:

  • Clusters the contig sequences, so the user can distinguish 'real' contigs (i.e. those that correspond to an entire replicon) from spurious and/or incomplete contigs.
  • Aligns the alternative contig sequences to each other and repairs circularisation issues.
  • Performs a multiple sequence alignment (MSA) of the alternative sequences.
  • Constructs a consensus sequence from the MSA by choosing between alternative options where the sequences differ.

The end result is a long-read assembly which you can trust!

One important caveat: Trycycler does not ensure a perfect assembly of the underlying genome. This is because systematic basecalling errors can create consistent small-scale errors. Homopolymers are a common source of this kind of error, e.g. AAAAAAAA becoming AAAAAAA (read more about this here). But if all goes well when running Trycycler, small-scale errors will be the only type of error in your long-read assembly. Short-read polishing (e.g. with Pilon) can be used to repair such small-scale errors. Therefore a Trycycler-followed-by-Pilon approach to assembly can yield the best possible bacterial genome assembly: no medium-to-large-scale errors because of Trycycler and no small-scale errors because of Pilon.

Clone this wiki locally