Skip to content

Mediocre dataset analysis

Ryan Wick edited this page Jul 13, 2020 · 6 revisions

Trycycler cluster

To begin, I ran Trycycler cluster on the assemblies:

trycycler cluster --reads reads.fastq.gz --assemblies assemblies/*.fasta --out_dir trycycler

Which produced this output.

There were nine clusters, but everything but the first two only contain a single contig. Here's the tree:

Mediocre dataset tree

I therefore conclude that cluster 1 is the chromosome and cluster 2 is the plasmid. All other clusters look like junk so I get rid of them:

mv trycycler/cluster_003 trycycler/bad_cluster_003
mv trycycler/cluster_004 trycycler/bad_cluster_004
mv trycycler/cluster_005 trycycler/bad_cluster_005
mv trycycler/cluster_006 trycycler/bad_cluster_006
mv trycycler/cluster_007 trycycler/bad_cluster_007
mv trycycler/cluster_008 trycycler/bad_cluster_008
mv trycycler/cluster_009 trycycler/bad_cluster_009

Prefixing the clusters with bad_ will allow me to later glob for all good clusters with trycycler/cluster_*. Alternatively, you could just delete the bad cluster directories.

Trycycler reconcile – cluster 1

I then ran Trycycler reconcile on the first (chromosomal) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. One of the sequences (F_utg000001c) failed to circularise due to multiple start/end hits, so I removed it and ran Trycycler reconcile again:

mv trycycler/cluster_001/1_contigs/F_utg000001c.fasta trycycler/cluster_001/1_contigs/F_utg000001c.fasta.bad
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. Now a different sequence (I_utg000001l) failed to circularise, this time due to a too-large start/end gap (consistent with its shorter length). I removed it and ran Trycycler reconcile again:

mv trycycler/cluster_001/1_contigs/I_utg000001l.fasta trycycler/cluster_001/1_contigs/I_utg000001l.fasta.bad
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. Nearly there! The circularisation completed, but one of the sequences (C_contig_3) looked terrible in the final alignment check. So I removed it and ran Trycycler reconcile again:

mv trycycler/cluster_001/1_contigs/C_contig_3.fasta trycycler/cluster_001/1_contigs/C_contig_3.fasta.bad
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. Finally, cluster 1 is reconciled!

Trycycler reconcile – cluster 2

I then ran Trycycler reconcile on the second (plasmid) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output. We encountered the same problem as in the Good dataset analysis, where there was a doubled-version of the plasmid in one of the contigs. I manually repaired that contig and ran Trycycler reconcile again:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output. Everything looks good this time!

Trycycler MSA, partition and consensus

Since these final three steps usually don't require human intervention, I'll run them all at once here. I'll also use Bash loops to run the cluster-specific commands on each cluster. While this may look a bit clunky, it's nice and reusable as it will work for Trycycler genomes with any number of clusters:

for c in trycycler/cluster_*; do
    trycycler msa --cluster_dir "$c"
done
trycycler partition --reads reads.fastq.gz --cluster_dirs trycycler/cluster_*
for c in trycycler/cluster_*; do
    trycycler consensus --cluster_dir "$c"
done
cat trycycler/cluster_*/7_final_consensus.fasta > assembly.fasta

There are the outputs: cluster 1 msa, cluster 2 msa, partition, cluster 1 consensus and cluster 2 consensus.

Final thoughts

This genome had a few snags, but we were still able to get a nice assembly. The final sequences contain no significant errors: just some small-scale mistakes and wrong homopolymer lengths.

Clone this wiki locally