Skip to content

Linear sequences

Ryan Wick edited this page Nov 5, 2024 · 16 revisions

Linear sequences can be challenging to assemble correctly, and assemblers often produce poor contigs. Autocycler assumes that at some some (but ideally most) of its input assemblies are mostly correct. If most-to-all of the input assemblies have major problems (e.g. truncating the ends of a linear sequence), then Autocycler won't be able to home in on a correct consensus sequence. This means that Autocycler may not able to automatically complete linear sequences, and they will require manual intervention.

Hairpin ends

Some linear sequences have hairpin ends where one strand of DNA loop back to become its complement strand. This means that long reads can continue past the hairpin onto the other strand, so the reads do not end at the sequence end. This can confuse long-read assemblers, and their contigs often extend past the hairpin. Autocycler trim looks for this type of overlap and will trim it when possible.

Some input contigs may be too short (terminating before the hairpin) and some may be too long (extending past the hairpin). The latter is better for Autocycler, as it will allow Autocycler trim to trim the sequence at the right place. You may therefore benefit from curating your input assemblies to include contigs which extend past the hairpin. Autocycler dotplot can be a useful too to reveal which contigs do/don't extend past the hairpin.

Also, if your linear sequence has hairpin ends, you should be careful with quality-based read filtering. The part of the read which extends past the hairpin may be lower quality, dragging down the average read quality. So if you aggressively filter with Filtlong (which prefers reads with a higher average quality), you may deplete reads which span the hairpin, i.e. reads which reach the end of the sequence. Thanks to Nemanja Kuzmanovic for figuring this one out!

Blunt ends

Terminal inverted repeats

Some linear bacterial sequences have terminal inverted repeats (TIRs). If these are longer than the read length, the TIR may assemble into a single collapsed contig, leading to a fragmented incomplete assembly. If this happens consistently in the input assemblies, the genome may not be suitable for Autocycler (which requires mostly complete input assemblies).

Clone this wiki locally