-
Notifications
You must be signed in to change notification settings - Fork 6
Autocycler subsample
The autocycler subsample
command creates multiple read subsets from a single long-read dataset, minimising overlap to ensure each subset is as independent as possible. Each subset is then used in the next step to generate a separate assembly. Creating diverse input assemblies increases the likelihood of capturing all sequences in the genome while reducing the risk of shared assembly errors.
autocycler subsample --reads ont.fastq.gz --out_dir subsampled_reads --genome_size 5.5m
This command takes an input read set named ont.fastq.gz
and will create the subsampled_reads
directory which will the subsampled read files.
Usage: autocycler subsample [OPTIONS] --reads <READS> --out_dir <OUT_DIR> --genome_size <GENOME_SIZE>
Options:
-r, --reads <READS> Input long reads in FASTQ format (required)
-o, --out_dir <OUT_DIR> Output directory (required)
-g, --genome_size <GENOME_SIZE> Estimated genome size (required)
-c, --count <COUNT> Number of subsampled read sets to output [default: 4]
-d, --min_read_depth <MIN_READ_DEPTH> Minimum allowed read depth [default: 25.0]
-s, --seed <SEED> Seed for random number generator [default: 0]
-h, --help Print help
-V, --version Print version
- A genome size estimate is required for this step and for some assemblers in the next step. If you don't know the size of the genome you're assembling, see this page: Genome size estimation.
- The genome size estimate does not need to be exact, ±10% is fine.
- Suffixes can be used in the genome size estimate, e.g. 4m, 4000k and 4000000 are equivalent.
- Autocycler subsample reads the input FASTQ twice: first to get read statistics, then again to write the subsampled files. Because of this, the input must be a literal file – process substitutions like
<(cat *.fastq.gz)
are not supported.
Autocycler subsample is based on Trycycler subsample, sharing much of its logic but with a faster implementation.
The first thing Autocycler subsample does is randomly shuffle the reads. This is to account for any trends in the read set. For example, some ONT sequencing runs suffer from a drop in quality as the run progresses, so if your reads are in temporal order, the first n reads could be better than the last n reads. Randomly shuffling protects against this.
Autocycler subsample then decides what read depth each subset will be. This involves balancing two opposing goals:
- Maximising the independence of the read subsets.
- Ensuring sufficient read depth in each subset for a good assembly.
These goals can be contradictory because higher subset read depths can help with assembly but decrease independence (due to more overlap between subsets). Conversely, lower subset read depths can make assembly harder but increase independence.
Autocycler subsample uses this function to find a read subset depth: y = m log2(4x/m)/2
- x is the full read set depth, y is the subsampled read depth and m is the minimum subset read depth
- The deeper your input reads, the deeper your read subsets will be (with logarithmic growth).
- Explore this function interactively on Desmos
In plain language, this function says that an input read set of 25× depth will result in subsets of 25× depth (i.e. they'll be the same as the input set). It then requires a quadrupling of the input depth to add another 25× to the subsets. So 100× input depth gives 50× subset depth, 400x input depth gives 75× subset depth, 1600× input depth gives 100× subset depth, etc. The --min_read_depth
option lets you change the value of m.
I chose these values because 25× is a good lower bound (read sets below this depth often assemble poorly), 50× is a good target (all assemblers seem to work well in this range) and anything over 100× is unlikely to significantly improve assemblies.
Autocycler subsample uses the mean read length to calculate how many reads should go in each subset to achieve the target subset read depth. This means the different subsets won't have the exact same number of bases, but they will be close.
To maximise the independence of the subsets, the starting read index is offset by r/c, where r is the read count and c is the subset count. The value of c is controlled by the --count
option.
In these examples, the shuffled input reads are shown as a bar at the top. The subsets are coloured bars, with their position indicating how much they overlap with each other. I used --min_read_depth 35
and --count 12
for each.
A 150× depth read set represents a typical case for Autocycler: not spectacular but not bad either. Each resulting subsampled set is completely independent from a few others but has some shared reads with most other sets. For example, read set 1 has no overlap with sets 6, 7, and 8, but it does overlap somewhat with the other sets.
Since we have lower input depth in this example, Autocycler has made the subset depth lower as well. The subsets also overlap more, and no two subsets are completely independent.
This example has a very deep input set. The subsets are therefore deeper than in the the previous two examples, which might help them to assemble better, and they are quite independent, with each subset only sharing some overlap with two others.
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine