-
Notifications
You must be signed in to change notification settings - Fork 28
FAQ and miscellaneous tips
- Human judgement and manual intervention
- How can I tell if my read set is good enough for Trycycler?
- How do I run Trycycler on linear replicons?
- Can Trycycler help me to assemble my difficult genome?
- Will Trycycler work on metagenomes?
- Will Trycycler work on eukaryote genomes?
- Will Trycycler help to assemble a part of a eukaryote genome?
- What's with the name?
- Should I use Unicycler or Trycycler to assemble my bacterial genome?
- Do I need to use multiple different assemblers for Trycycler's input?
- Flye variation via thread count
- Manually fixing overlap
- Identifying contamination
- Should I polish my sequences with Racon before using Medaka?
- Can I run Medaka before Trycycler?
- What about PacBio reads?
- Can I assign weights to input assemblies?
- A note on MUSCLE versions
- Building an MSA with MAFFT
- How do I cite Trycycler?
Some steps of Trycycler are not fully automatic. I.e. they require the user to think about the results and possibly manually change things. Specifically, there are two main points in the Trycycler pipeline where judgement/intervention are needed. The first is after the Clustering contigs step, where users must decide which clusters are "real" (correspond to a replicon in the genome) and which are not. The second is during the Reconciling contigs step, where users may have to exclude certain contigs, manually repair circularisation issues or adjust settings.
To make effective judgement calls, it's important that users understand what Trycycler is doing in these steps. To that end, Trycycler's output contains explanatory text and recommendations for how to proceed.
While a fully automatic Trycycler would certainly be nice, I don't know if it's possible. Trycycler, at least in its current state, is not a tool for high-throughput assembly of lots of bacterial genomes. Rather, it's a tool for meticulously getting an assembly just right.
Trycycler does best with very nice input reads: deep (ideally 100× or more) and long (reliably longer than the longest repeat in your genome). If your read set isn't great, you might wonder whether or not you should attempt using Trycycler.
If your read set is very shallow (25× or less), then Trycycler is probably not for you. If you also have short reads for your genome, Unicycler is probably a better choice (see the Unicycler vs Trycycler question). If you don't have short reads, then I'd recommend Flye which seems better than other long-read assemblers at coping with shallow read sets (see our benchmarking paper for more info).
If you're in a grey area (reads aren't great but not too bad either), you should give Trycycler a shot. The results of the contig clustering step will then be a good indicator as to whether you should continue. I.e. if your contigs form nice clear clusters, then your reads are good enough. But if your contigs form messy ambiguous clusters, then you'll need better reads before using Trycycler.
While Trycycler was mainly built to work with circular replicons, it can work with linear replicons too! Just use the --linear
option when running trycycler reconcile
and trycycler consensus
.
The main problem you might encounter is having to know that your replicon is linear before you begin. For some species (e.g. Borrelia burgdorferi), the replicon's linear nature might be known in advance, which makes things easier. But if your genome has a linear plasmid that you weren't aware of in advance, it could be harder. Your first clue will likely be that the corresponding contig cluster will fail to circularise when running Trycycler reconcile. If this happens, you should dig deeper (e.g. BLASTing to known sequences or examining a short-read assembly graph in Bandage, see this paper) to decide whether or not the replicon is linear. Another clue might come from running Trycycler dotplot – if all the contigs in a cluster start/end at the same position, that suggests that the sequence is linear.
You should think of Trycycler as a tool that helps you take a genome which is reasonably easy to assemble and make it as perfect as possible. It's not a tool that helps you take a very difficult genome and get it assembled.
So if your read set is particularly challenging to assemble (e.g. it has lots of big repeats) and you're getting poor assemblies (e.g. inconsistent or fragmented contigs), then Trycycler will probably not help you. If you're in this situation, you probably need a better long-read set: longer, deeper and more accurate.
In principle, yes, it could! However, remember that Trycycler is all about cleaning up and optimising completed bacterial sequences, where the entire chromosome/plasmid has been assembled into one contig. In a long-read metagenome assembly, completely assembled sequences are only likely to occur for high-abundance genomes, i.e. low-abundance genomes will likely result in fragmented contigs. And genomes can fragment for other reasons as well – metagenome assembly is hard!
For Trycycler, this means that you'll probably want to filter your input assemblies to only contain completed contigs, which will make the clustering step much easier. Regarding how to determine completeness, the circularity and depth of a contig would be good indicators. E.g. if a contig is circular and of decent read depth (>20x), it's probably complete. However, this approach will obviously not work for linear chromosomes/plasmids.
If all goes well, you should be able to use Trycycler to get improved contigs for some of the high abundance genomes in your metagenome. But there are a lot of potential complications in metagenomics, so your mileage may vary!
The short answer is: probably not. This is related to the previous question – Trycycler assumes that long-read assemblers can at least somewhat reliably assemble your genome to completion, i.e. one contig per replicon without any major misassemblies. In my limited experience with eukaryote genomes, this is usually not the case, due to genome complexity and/or diploidy.
However, if you have very long, very deep read sets and you can reliably get chromosome-scale contigs, then by all means, give Trycycler a try!
This is related to the previous question, but narrower in scope. For example, let's say you were trying to get a nice assembly of the MHC region of a human genome.
Trycycler could be a useful tool here, as long as you prepared the input contigs. Specifically, you should ensure that each of your input contigs starts and ends in the same place – i.e. they are alternative assemblies of the exact same region. Then you could indeed give them to Trycycler!
Diploidy could be a problem, however. Trycycler attempts to make its consensus best match the true sequence, but in a diploid genome there is not one but two true sequences! I would therefore recommended first phasing your long reads into haplotypes, then running assembly/Trycycler on each haplotype separately.
Trycycler is something of a sequel to Unicycler, a tool I made for hybrid bacterial genome assembly. Before I started development, Torsten Seemann and I were discussing the possibility of combining multiple separate long-read assemblies into a more confident consensus, and he suggested the name "Trycycler" for such a tool. Thanks, Torsten!
If you only have short reads for your genome, then the answer is clear: Unicycler (Trycycler does not do short-read assembly). If you only have long reads, then you should use Trycycler.
Assuming you have both long and short reads, then Unicycler and Trycycler+polishing are both viable options for assembly. So which should you use?
If you have lots of long reads (~100× depth or more), use Trycycler+polishing. If you have sparse long reads (~25× or less), use Unicycler. If your long-read depth falls between those values, it might be worth trying both approaches.
Unicycler works best when the short-read set is very good (deep and complete coverage) which yields a nice short-read assembly graph for scaffolding. Conversely, when Unicycler fails, it's usually due to problems with the short-read assembly graph. The Trycycler+polishing approach is much less dependent on the quality of the short-read set. However, Trycycler requires a deep long-read set while Unicycler does not.
I have also seen occasions where small misassemblies occur within short-read contigs in Unicycler (made by SPAdes). This usually happens in repetitive regions of the genome. Since Unicycler builds its final assembly by scaffolding the short-read contigs, any misassemblies they contain will persist in the final assembly. Trycycler seems to do much better in such regions.
Unicycler was built in a different time (2016) when Oxford Nanopore read sets could be quite shallow, so it was necessary to rely more on short-read sets. Since then, improvements in Oxford Nanopore yield have largely fixed that problem. So as I write this in 2020, I view Trycycler+polishing as the best way to do a hybrid bacterial genome assembly, with Unicycler as a fall-back option for cases where your short-read set is good but your long-read set is weak.
In the Generating assemblies instructions, I use multiple different assemblers to generate the input assemblies for Trycycler. You might wonder if instead you could just use one assembler, e.g. a favourite like Flye.
This might work, so you are welcome to try if you want. However, I have noticed cases where one assembler tends to make the same mistake, even on different read subsets. Using multiple assemblers helps to guard against this possibility.
Using multiple assemblers may also help to ensure that Trycycler succeeds with its circularisation repair. I have seen cases where all of the input contigs generated by Flye have the same start/end position. If these were the only assemblies given to Trycycler, it would not have enough information to do circularisation repair. I.e. Trycycler requires that input contigs for circular replicons have a variety of start/end positions.
Flye assemblies are non-deterministic, i.e. the same read set and command can yield different results across multiple runs. @gbouras13 noticed that this is especially true if you use different thread counts (e.g. 4, 8, 16), providing a possible source of input variation for Trycycler. This could be useful if you have lower read depth which makes it more difficult to get input variation. For example, you could generate additional input assemblies for Trycycler by assembling the same read subsets using Flye with different thread counts.
Replicons sometimes suffer from too much start-end overlap for Trycycler reconcile to continue. E.g. an 18 kbp plasmid which has assembled into a 29 kbp contig (~11 kb of duplicated sequence). For smaller plasmids, a complete doubling of the plasmid is possible. This is most common in Canu and NextDenovo/NextPolish assemblies, but it can happen in other assemblers as well.
This can result in Trycycler reconcile refusing to continue like this:
Input contigs:
trycycler/cluster_006/1_contigs/A_tig00000004.fasta (29,356 bp)
trycycler/cluster_006/1_contigs/B_contig_2.fasta (18,017 bp)
trycycler/cluster_006/1_contigs/C_utg000002c.fasta (18,103 bp)
trycycler/cluster_006/1_contigs/D_bctg00000001.fasta (18,186 bp)
trycycler/cluster_006/1_contigs/E_Utg1.fasta (17,838 bp)
...
Relative sequence lengths:
A_tig00000004: 1.000 1.629 1.622 1.614 1.646
B_contig_2: 0.614 1.000 0.995 0.991 1.010
C_utg000002c: 0.617 1.005 1.000 0.995 1.015
D_bctg00000001: 0.619 1.009 1.005 1.000 1.020
E_Utg1: 0.608 0.990 0.985 0.981 1.000
Error: there is too much length difference between contigs. You must either
exclude or repair the offending contig sequences and then try running trycycler
reconcile again. If one of the sequences is too long, it could be due to excessive
circularisation overlap, and trimming that overlap may allow trycycler reconcile
to continue.
In this case, it is clear that contig A_tig00000004
has an abnormal length relative to the other contigs. I usually solve cases like this by manually finding and fixing the overlap:
- Open the
A_tig00000004.fasta
file in a text editor (like Atom). - Select a small (~50 bp) sample of sequence from near the start of the contig.
- Search for that sequence in the whole contig.
- Ideally, I will find two instances of that sequence: the one I originally selected and another much further on.
- If the spacing between the two instances corresponds to the apparent true length of the contig (~18 kbp in this case), then I assume they represent a single and complete copy of the plasmid sequence.
- I then use the text editor to delete everything before the first instance and everything after (and including) the second instance. This will give a cleanly circularised contig that should be accepted by Trycycler reconcile:
Input contigs:
trycycler/cluster_006/1_contigs/A_tig00000004.fasta (17,940 bp)
trycycler/cluster_006/1_contigs/B_contig_2.fasta (18,017 bp)
trycycler/cluster_006/1_contigs/C_utg000002c.fasta (18,103 bp)
trycycler/cluster_006/1_contigs/D_bctg00000001.fasta (18,186 bp)
trycycler/cluster_006/1_contigs/E_Utg1.fasta (17,838 bp)
...
Relative sequence lengths:
A_tig00000004: 1.000 0.996 0.991 0.986 1.006
B_contig_2: 1.004 1.000 0.995 0.991 1.010
C_utg000002c: 1.009 1.005 1.000 0.995 1.015
D_bctg00000001: 1.014 1.009 1.005 1.000 1.020
E_Utg1: 0.994 0.990 0.985 0.981 1.000
Unfortunately, contamination can happen, and this means that your input assemblies might contain contigs which originated from a different genome. I most often see this occur with cross-barcode contamination, where a contig in one assembly belongs to a different genome from the same multiplexed sequencing run.
There is no quick-and-easy way to decide whether a cluster is real or contamination, but I can offer some tips:
- Look at the read depth: lower-than-expected depth is a red flag. For example, if the chromosome in your genome has a depth of 200× and you also have a cluster of short sequences that are 40× (i.e. only one-fifth the chromosomal depth), then I would be suspicious of that shorter cluster.
- Use BLAST or a tool like Kraken2 to classify the contigs in your clusters. If you see an odd mismatch, that could be a red flag. But keep in mind that some plasmids have a broad host range and therefore can't be classified to the species level.
- Compare suspicious contigs to other genomes that were multiplexed on the same sequencing run. If there is strong cross-barcode similarity and you didn't expect the two genomes to be closely related, that's a red flag.
Here's an example to put it all together. Imagine that you have a cluster in a Salmonella genome that is made up of small contigs (~10 kbp). These contigs are low depth (lower than the Salmonella chromosome), and when you BLAST them, they look like a Staphylococcus plasmid. It turns out that a Staphylococcus genome was also part of the same sequencing run, and your cluster is a very close match to one of its plasmids. All of this evidence adds up to tell you that the contig is cross-barcode contamination and should be deleted.
In the Polishing after Trycycler section, I suggested running Medaka on each contig. I did not suggest running Racon first, which contrasts with an old recommendation in the Medaka documentation. So you might wonder: to Racon or not to Racon?
I would recommend against running Racon before Medaka. This is because in my experience, Racon is more likely to introduce errors than fix errors in a Trycycler contig. So I suggest skipping Racon and going right to Medaka.
In this wiki, I describe an assembly process where you run Trycycler first and then use Medaka to polish the result. You might wonder if you could instead run Medaka before Trycycler. I.e. generate your input assemblies, run Medaka on each of them, then use Trycycler to make a consensus.
Yes, I believe this would be a perfectly valid approach! I opted for the Medaka-last method because it's simpler: you only have to run Medaka a single time instead of multiple times (once on each of your input assemblies). But either way should work. I haven't done an investigation to see whether one approach can reliably give better results than the other, but I would hypothesise that they are equivalent.
While I have more experience with Oxford Nanopore, Trycycler is appropriate for PacBio read sets as well!
If you're using older PacBio CLR reads, then instead of doing post-Trycycler polishing with Medaka (which is ONT-specific), you should use a PacBio-based polishing tool like GCpp.
If you're using modern PacBio HiFi reads, then no post-assembly polishing should be necessary, as the HiFi reads themselves are polished. Note that because of the low error rate of HiFi reads, some assemblers have HiFi presets, so you might need to change your commands in the Generating assemblies step. E.g. using --pacbio-hifi
for Flye.
If for some reason you trust some of your input assemblies more than others, you might wonder if you could give extra weight to those assemblies in Trycycler's consensus.
While this feature is not built into Trycycler, you could hack it by manually duplicating the contigs you trust more. It would make sense to do this in the 2_all_seqs.fasta
file created by the Reconciling contigs step. E.g. after reconciling contigs, open the 2_all_seqs.fasta
file in a text editor and make extra copies of the contigs you trust most. These will then get extra weight in the consensus step.
It's worth noting that I haven't personally encountered a situation where it was necessary to do this, so it should be an unusual course of action.
MUSCLE is a multiple-sequence aligner that's been around for a long time. When I first wrote Trycycler, MUSCLE v3.8 was the latest version, but since then MUSCLE v5 has come out (they seem to have skipped v4). The new version of MUSCLE has a different command line interface, so any version of Trycycler before v0.5.2 will not work with MUSCLE v5. Trycycler v0.5.2 and later should work with both MUSCLE v3 and MUSCLE v5.
In my initial tests, v3 and v5 produced very similar results in Trycycler, but v5 is slower. However, users have reported memory issues and missing contigs with MUSCLE v5. For these reasons, I recommend using sticking with the old version: MUSCLE v3.
Also, some users have reported that Trycycler MSA can crash, and it seems that MUSCLE versions may be behind the problem: v3.8.1551 worked but v3.8.31 did not. So make sure to use the most recent version of MUSCLE v3. Thanks to @marade for figuring that out.
Some users have reported cases where Trycycler's Multiple sequence alignment fails because some MUSCLE jobs didn't finish for unclear reasons. Egon Ozer shared a workaround in this issue: using MAFFT to generate the 3_msa.fasta
file instead. This works because Trycycler only needs a global MSA in FASTA format – it doesn't care what problem made the MSA.
You can run MAFFT online (here or here) or on the command line. Note that this may be quite slow, so be patient. Give the 2_all_seqs.fasta
file as input and rename the output to 3_msa.fasta
, and subsequent Trycycler steps should work.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth