Parallelising input assemblies

The Generating input assemblies step is by far the slowest part of an Autocycler assembly, so running assemblies in parallel can save a lot of time. The best way to do this will depend on your system, but this page offers a few general suggestions to get you started.

GNU Parallel

If you are running your assemblies on a server that is not managed by a queuing system, GNU parallel can manage the parallelisation:

threads=16  # set as appropriate for your system
jobs=4      # set as appropriate for your system

genome_size=$(genome_size_raven.sh "$reads" "$threads")  # can set this manually if you know the value

mkdir -p assemblies
rm -f assemblies/jobs.txt
for assembler in canu flye miniasm necat nextdenovo raven; do
    for i in 01 02 03 04; do
        echo "nice -n 19 $assembler.sh subsampled_reads/sample_$i.fastq assemblies/${assembler}_$i $threads $genome_size" >> assemblies/jobs.txt
    done
done
parallel --jobs "$jobs" --joblog assemblies/joblog.txt --results assemblies/logs < assemblies/jobs.txt

Some notes on the above commands:

The jobs variable stores the number of concurrent assemblies, so when set to 4 (as above), 4 assemblies will run at once. Running too many jobs at once can overwhelm your system, especially if you run out of RAM. I recommend conservatively setting jobs to a small value, at least initially.
The threads variable is how many threads are used per job. So the total threads used will be threads×jobs (64 in the above example).
Tools such as top and htop can be used to monitor the resource usage of in-progress assemblies.
To insure that the assemblies run at a lower CPU priority, I included nice -n 19. This is optional but can help reduce the risk of the assemblies slowing down other processes.
The assemblies/logs directory will contain a subdirectory for each assembly containing stdout and stderr files, which can be useful in troubleshooting any unsuccessful assemblies.

SLURM

If you are running your assemblies on a cluster that is managed by SLURM, these commands can serve as a template:

genome_size=$(genome_size_raven.sh "$reads" "$threads")  # can set this manually if you know the value

mkdir -p assemblies
for i in 01 02 03 04; do
    sbatch --job-name=canu_"$i" --time=12:00:00 --mem=128000 --ntasks=1 --cpus-per-task=16 --wrap "canu.sh subsampled_reads/sample_$i.fastq assemblies/canu_$i $threads $genome_size"
    sbatch --job-name=flye_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "flye.sh subsampled_reads/sample_$i.fastq assemblies/flye_$i $threads $genome_size"
    sbatch --job-name=miniasm_"$i" --time=1:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "miniasm.sh subsampled_reads/sample_$i.fastq assemblies/miniasm_$i $threads $genome_size"
    sbatch --job-name=necat_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "necat.sh subsampled_reads/sample_$i.fastq assemblies/necat_$i $threads $genome_size"
    sbatch --job-name=nextdenovo_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "nextdenovo.sh subsampled_reads/sample_$i.fastq assemblies/nextdenovo_$i $threads $genome_size"
    sbatch --job-name=raven_"$i" --time=1:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "raven.sh subsampled_reads/sample_$i.fastq assemblies/raven_$i $threads $genome_size"
done

Some notes on the above commands:

Additional sbatch flags (e.g. --partition, --qos, etc.) may be necessary, depending on the configuration of your cluster.
The above commands separate each assembler to allow for customising the resource requirements. For example, Canu is more resource intensive, so it can be granted a longer wall time and more memory than other assemblers.
The above commands assume that the assemblers and helper scripts are available in the PATH variable for the running jobs. Depending on the system setup, it may be necessary to activate the environment somehow (e.g. conda activate autocycler or module load autocycler) at the start of the wrapped command.
- For example: sbatch --job-name=canu_"$i" --time=12:00:00 --ntasks=1 --mem=128000 --cpus-per-task=16 --wrap "conda activate autocycler && canu.sh subsampled_reads/sample_$i.fastq assemblies/${assembler}_$i $threads $genome_size"
Tools such as squeue, sacct and scancel can be used to monitor and control SLURM jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising input assemblies

GNU Parallel

SLURM

Overview

Quick start

Main Autocycler steps

Assessing Autocycler assemblies

Additional commands

Other pages

Clone this wiki locally