Skip to content

Generating input assemblies

Ryan Wick edited this page Jan 6, 2025 · 27 revisions

Basics

Each of the read subsets made by Autocycler subsample can now be assembled, producing multiple alternative assemblies of the same genome. It is beneficial to use different assemblers, as different tools can make different errors. While these assemblies are not produced by Autocycler, I have produced some helper scripts for common long-read assemblers to make them easier to run.

Importantly, the input assemblies should be complete, i.e. each sequence in the genome should be assembled into a single contig. Autocycler can tolerate a small number of fragmented input assemblies, but if most or all input assemblies are fragmented, Autocycler will not work. In order to achieve this, you will likely need reads which are at least as long as the longest repeat in the genome. For most bacterial genomes, 10kbp reads will be sufficient, but some genomes with long repeats will require longer reads.

Example commands

threads=16  # set as appropriate for your system
genome_size=$(genome_size_raven.sh "$reads" "$threads")  # can set this manually if you know the value

mkdir assemblies
for assembler in canu flye miniasm necat nextdenovo raven; do
    for i in 01 02 03 04; do
        "$assembler".sh subsampled_reads/sample_"$i".fastq assemblies/"$assembler"_"$i" "$threads" "$genome_size"
    done
done

These commands make use of the helper scripts (see below) and assume that Autocycler subsample was run with the default value of --count 4.

Assembly helper scripts

The helper scripts are written in Bash and can be found in the scripts directory of the Autocycler repository.

There are nine helper scripts available: canu.sh, flye.sh, lja.sh, metamdbg.sh, miniasm.sh, necat.sh, nextdenovo.sh, raven.sh and redbean.sh.

Each takes the same arguments in the same order:

  1. Input reads (FASTQ format)
  2. Location/prefix of output file(s) (excluding file extension)
  3. Thread count
  4. Genome size (only required by some scripts but can be safely included for all)

All scripts will produce a FASTA file as their final result with the name prefix.fasta. For tools that also produce a GFA file, this will be saved as prefix.gfa. The GFA file is not used by Autocycler, but it can be helpful for manual curation of assembly quality (see below). The assemblies are performed in temporary directories, so all other files (intermediate files, logs, etc) will be deleted.

The software requirements for each script are described at the top of the file. Some scripts (e.g. flye.sh) only require the assembler itself. Others (e.g. miniasm.sh) require other tools as well. If you use conda to manage your software dependencies, I have created a conda environment file which includes all tools needed for all helper scripts (see Software requirements and installation).

Notes

  • Autocycler itself is relatively quick to run, so this step (generating input assemblies) will be the slowest step of a full Autocycler assembly. See Parallelising input assemblies for some suggestions on speeding this up.
  • While the above commands are written to run in serial (one after another), the input assemblies can be generated in parallel to save time, if you have the computational resources to do so.
  • If something goes wrong when running a helper script (e.g. the assembler was not installed properly), the resulting file will be empty, and Autocycler compress will not be able to run. At least for your first time using an Autocycler installation, it's good to check that all input assemblies have an appropriate size before continuing.
  • Canu assemblies can be very accurate but it often takes much longer to run than other assemblers. I recommend using Canu if you have the time and computational resources, but you may wish to exclude it for performance reasons.
  • In my experience, NECAT can sometimes crash and fail to produce an assembly. Re-running with a different thread count often allows it to complete successfully.
  • I did not include LJA's helper script in the example commands because it was designed specifically for PacBio HiFi reads. However, I have tried it on modern ONT reads (R10.4.1 with sup basecalling) and it worked well.
  • I did not include metaMDBG's helper script in the example commands because it was designed specifically for metagenomes. However, it seems to work well on isolates, and the helper script includes a depth filter to remove low-depth contigs from the assembly.
  • I did not include Redbean's helper script in the example commands because in my experience it produces lower quality assemblies. It is fast to run, however, so may be appropriate when computational resources are limited.

Optional manual intervention

Before continuing to the next step, consider inspecting each input assembly to identify and remove incomplete or problematic assemblies. This optional step helps ensure the rest of Autocycler's pipeline proceeds smoothly.

Consider the following:

  • Check chromosome length: For most bacterial genomes, a single chromosome constitutes the majority of the genome. If an assembly lacks a contig of the expected chromosomal length, it is likely fragmented. For example, an E. coli chromosome is ~5 Mbp in length, so a complete E. coli assembly should contain a ~5 Mbp contig. If an E. coli assembly instead contains a 3 Mbp and 2 Mbp contig, it is almost certainly incomplete.
  • Visual inspection of GFA: For assemblers that produce a GFA graph file, a visual inspection in Bandage (or the currently maintained fork BandageNG) can be useful. Most bacterial sequences are circular, and a contig with a single circularising link (connecting its start to its end) in the graph is likely complete.
  • Input read sufficiency: If one or two or the input assemblies contain incomplete sequences, that's generally not a concern. But if many or most input assemblies are fragmented, that suggests that your input read set may be insufficient. For example, your genome may contain a long repeat exceeding your read length.
  • What to remove: If an assembly has a fragmented sequence, you can delete the fragmented contigs but leave the rest of the assembly. For example, if an assembly had an incomplete chromosome but all the plasmids were complete, then you can just delete the chromosomal fragments and leave the complete plasmids.
Clone this wiki locally