-
Notifications
You must be signed in to change notification settings - Fork 6
Parallelising input assemblies
Ryan Wick edited this page Jan 6, 2025
·
7 revisions
The Generating input assemblies step is by far the slowest part of an Autocycler assembly, so running assemblies in parallel can save a lot of time. The best way to do this will depend on your system, but this page offers a few general suggestions to get you started.
If you are running your assemblies on a server that is not managed by a queuing system, GNU parallel can manage the parallelisation:
threads=16 # set as appropriate for your system
jobs=4 # set as appropriate for your system
genome_size=$(genome_size_raven.sh "$reads" "$threads") # can set this manually if you know the value
mkdir -p assemblies
rm -f assemblies/jobs.txt
for assembler in canu flye miniasm necat nextdenovo raven; do
for i in 01 02 03 04; do
echo "nice -n 19 $assembler.sh subsampled_reads/sample_$i.fastq assemblies/${assembler}_$i $threads $genome_size" >> assemblies/jobs.txt
done
done
parallel --jobs "$jobs" --joblog assemblies/joblog.txt --results assemblies/logs < assemblies/jobs.txt
Some notes on the above commands:
- The
jobs
variable stores the number of concurrent assemblies, so when set to 4 (as above), 4 assemblies will run at once. Running too many jobs at once can overwhelm your system, especially if you run out of RAM. I recommend conservatively settingjobs
to a small value, at least initially. - The
threads
variable is how many threads are used per job. So the total threads used will bethreads
×jobs
(64 in the above example). - Tools such as
top
andhtop
can be used to monitor the resource usage of in-progress assemblies. - To insure that the assemblies run at a lower CPU priority, I included
nice -n 19
. This is optional but can help reduce the risk of the assemblies slowing down other processes. - The
assemblies/logs
directory will contain a subdirectory for each assembly containingstdout
andstderr
files, which can be useful in troubleshooting any unsuccessful assemblies.
If you are running your assemblies on a cluster that is managed by SLURM, these commands can serve as a template:
genome_size=$(genome_size_raven.sh "$reads" "$threads") # can set this manually if you know the value
mkdir -p assemblies
for i in 01 02 03 04; do
sbatch --job-name=canu_"$i" --time=12:00:00 --mem=128000 --ntasks=1 --cpus-per-task=16 --wrap "canu.sh subsampled_reads/sample_$i.fastq assemblies/canu_$i $threads $genome_size"
sbatch --job-name=flye_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "flye.sh subsampled_reads/sample_$i.fastq assemblies/flye_$i $threads $genome_size"
sbatch --job-name=miniasm_"$i" --time=1:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "miniasm.sh subsampled_reads/sample_$i.fastq assemblies/miniasm_$i $threads $genome_size"
sbatch --job-name=necat_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "necat.sh subsampled_reads/sample_$i.fastq assemblies/necat_$i $threads $genome_size"
sbatch --job-name=nextdenovo_"$i" --time=2:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "nextdenovo.sh subsampled_reads/sample_$i.fastq assemblies/nextdenovo_$i $threads $genome_size"
sbatch --job-name=raven_"$i" --time=1:00:00 --mem=64000 --ntasks=1 --cpus-per-task=16 --wrap "raven.sh subsampled_reads/sample_$i.fastq assemblies/raven_$i $threads $genome_size"
done
Some notes on the above commands:
- Additional
sbatch
flags (e.g.--partition
,--qos
, etc.) may be necessary, depending on the configuration of your cluster. - The above commands separate each assembler to allow for customising the resource requirements. For example, Canu is more resource intensive, so it can be granted a longer wall time and more memory than other assemblers.
- The above commands assume that the assemblers and helper scripts are available in the
PATH
variable for the running jobs. Depending on the system setup, it may be necessary to activate the environment somehow (e.g.conda activate autocycler
ormodule load autocycler
) at the start of the wrapped command.- For example:
sbatch --job-name=canu_"$i" --time=12:00:00 --ntasks=1 --mem=128000 --cpus-per-task=16 --wrap "conda activate autocycler && canu.sh subsampled_reads/sample_$i.fastq assemblies/${assembler}_$i $threads $genome_size"
- For example:
- Tools such as
squeue
,sacct
andscancel
can be used to monitor and control SLURM jobs.
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine