Skip to content

Autocycler subsample

Ryan Wick edited this page Dec 18, 2024 · 17 revisions

Basics

The autocycler subsample command creates multiple read subsets from a single long-read dataset, minimising overlap to ensure each subset is as independent as possible. Each subset is then used in the next step to generate a separate assembly. Creating diverse input assemblies increases the likelihood of capturing all sequences in the genome while reducing the risk of shared assembly errors.

Example command

autocycler subsample --reads ont.fastq.gz --out_dir subsampled_reads --genome_size 5.5m

This command takes an input read set named ont.fastq.gz and will create the subsampled_reads directory which will the subsampled read files.

Full usage

Usage: autocycler subsample [OPTIONS] --reads <READS> --out_dir <OUT_DIR> --genome_size <GENOME_SIZE>

Options:
  -r, --reads <READS>                    Input long reads in FASTQ format (required)
  -o, --out_dir <OUT_DIR>                Output directory (required)
  -g, --genome_size <GENOME_SIZE>        Estimated genome size (required)
  -c, --count <COUNT>                    Number of subsampled read sets to output [default: 4]
  -d, --min_read_depth <MIN_READ_DEPTH>  Minimum allowed read depth [default: 25.0]
  -s, --seed <SEED>                      Seed for random number generator [default: 0]
  -h, --help                             Print help
  -V, --version                          Print version

Notes

  • A genome size estimate is required for this step and for some assemblers in the next step. If you don't know the size of the genome you're assembling, see this page: Genome size estimation.
  • The genome size estimate does not need to be exact, ±10% is fine.
  • Suffixes can be used in the genome size estimate, e.g. 4m, 4000k and 4000000 are equivalent.
  • Autocycler subsample reads the input FASTQ twice: first to get read statistics, then again to write the subsampled files. Because of this, the input must be a literal file – process substitutions like <(cat *.fastq.gz) are not supported.

Implementation details

Autocycler subsample is based on Trycycler subsample, sharing much of its logic but with a faster implementation.

Step 1: shuffling

The first thing Autocycler subsample does is randomly shuffle the reads. This is to account for any trends in the read set. For example, some ONT sequencing runs suffer from a drop in quality as the run progresses, so if your reads are in temporal order, the first n reads could be better than the last n reads. Randomly shuffling protects against this.

Step 2: determining the subset depth

Autocycler subsample then decides what read depth each subset will be. This involves balancing two opposing goals:

  • Maximising the independence of the read subsets.
  • Ensuring sufficient read depth in each subset for a good assembly.

These goals can be contradictory because higher subset read depths can help with assembly but decrease independence (due to more overlap between subsets). Conversely, lower subset read depths can make assembly harder but increase independence.

Autocycler subsample uses this function to find a read subset depth: y = m log2(4x/m)/2

  • x is the full read set depth, y is the subsampled read depth and m is the minimum subset read depth
  • The deeper your input reads, the deeper your read subsets will be (with logarithmic growth).
  • Explore this function interactively on Desmos

In plain language, this function says that an input read set of 25× depth will result in subsets of 25× depth (i.e. they'll be the same as the input set). It then requires a quadrupling of the input depth to add another 25× to the subsets. So 100× input depth gives 50× subset depth, 400x input depth gives 75× subset depth, 1600× input depth gives 100× subset depth, etc. The --min_read_depth option lets you change the value of m.

I chose these values because 25× is a good lower bound (read sets below this depth often assemble poorly), 50× is a good target (all assemblers seem to work well in this range) and anything over 100× is unlikely to significantly improve assemblies.

Step 3: assigning reads to subsets

Autocycler subsample uses the mean read length to calculate how many reads should go in each subset to achieve the target subset read depth. This means the different subsets won't have the exact same number of bases, but they will be close.

To maximise the independence of the subsets, the starting read index is offset by r/c, where r is the read count and c is the subset count. The value of c is controlled by the --count option.

Examples

In these examples, the shuffled input reads are shown as a bar at the top. The subsets are coloured bars, with their position indicating how much they overlap with each other. I used --min_read_depth 35 and --count 12 for each.



read subset example 1

A 150× depth read set represents a typical case for Autocycler: not spectacular but not bad either. Each resulting subsampled set is completely independent from a few others but has some shared reads with most other sets. For example, read set 1 has no overlap with sets 6, 7, and 8, but it does overlap somewhat with the other sets.



read subset example 2

Since we have lower input depth in this example, Autocycler has made the subset depth lower as well. The subsets also overlap more, and no two subsets are completely independent.



read subset example 3

This example has a very deep input set. The subsets are therefore deeper than in the the previous two examples, which might help them to assemble better, and they are quite independent, with each subset only sharing some overlap with two others.

Clone this wiki locally