Skip to content

Linear sequences

Ryan Wick edited this page Jan 22, 2025 · 16 revisions

Linear sequences can be challenging to assemble correctly, and assemblers often produce poor contigs. Autocycler assumes that at least some (but ideally most) of its input assemblies are mostly correct. If most-to-all of the input assemblies have major problems (e.g. truncating the ends of a linear sequence), then Autocycler won't be able to home in on a correct consensus sequence. This means that Autocycler may not able to automatically complete linear sequences, and they will require manual intervention. See the Autocycler clean page for examples.

Hairpin ends

Some linear sequences have hairpin ends where one strand of DNA loop back to become its complement strand. This means that long reads can continue past the hairpin onto the other strand, so the reads do not end at the sequence end. This can confuse long-read assemblers, and their contigs often extend past the hairpin. Autocycler trim looks for this type of overlap and will trim it when possible.

Some input contigs may be too short (terminating before the hairpin) and some may be too long (extending past the hairpin). The latter is better for Autocycler, as it will allow Autocycler trim to trim the sequence at the right place. You may therefore benefit from curating your input assemblies to include contigs which extend past the hairpin. Autocycler dotplot can be a useful too to reveal which contigs do/don't extend past the hairpin.

Also, if your linear sequence has hairpin ends, you should be careful with quality-based read filtering. The part of the read which extends past the hairpin may be lower quality, dragging down the average read quality. So if you aggressively filter with Filtlong (which prefers reads with a higher average quality), you may deplete reads which span the hairpin, i.e. reads which reach the end of the sequence. Thanks to Nemanja Kuzmanovic for figuring this one out!

Open ends

I will refer to non-hairpin ends of a linear sequence as 'open ends'. When a linear sequence has an open end, contigs are less likely to be overly long. Insufficiently long contigs are more of problem, i.e. contigs which terminate before the sequence end. Some linear sequences have terminal proteins covalently attached to the DNA ends, which may affect long-read sequencing.

When manually curating an open-end linear sequence, you may want to extend the final consensus sequence as far as possible, using the longest of the input sequences. Then a post-assembly read-mapping step can inform whether any trimming of the consensus sequence is warranted.

Terminal inverted repeats

Some linear bacterial sequences have terminal inverted repeats (TIRs). If these are longer than the read length, the TIR may assemble into a single collapsed contig, leading to a fragmented incomplete assembly. If this happens consistently in the input assemblies, the genome may not be suitable for Autocycler (which requires mostly complete input assemblies).

Even if the TIR is resolved in input assemblies, it may collapse in Autocycler's unitig graph. If this happens, the 5_final.gfa file made by Autocycler resolve will contain multiple contigs, and the consensus assembly will therefore be incomplete and require manual resolution of the TIR with Autocycler clean.

Clone this wiki locally