This repository contains a Makefile to process paired-end illumina data with bowtie2, samtools, picard, and gatk. Much of the analyses are based on @knausb's bam processing workflow, but tweaked for a haploid plant pathogen with an available genome (here we use it for Sclerotinia sclerotiorum). This is designed to run on the HCC SLURM Cluster. There are no guarantees that this will work anywhere else.
To build your analysis, add your data to directories called reads/ and genome/. In your shell, type:
make
This will generate the genome index, sam files, bam files, g.vcf files
and a final GATK/res.vcf.gz
.
You can find all of the makefile options by typing make help
:
$ make help
COMMANDS
============
all makes everything: res.vcf.gz
index generate the bowtie2 index
trim trims reads with trimmomatic
map map reads with samtools
bam convert sam to bam, filter, and remove duplicates
validate run validation stats on the bam and sam files
gvcf create g.vcf files from HaplotypeCaller
vcf create vcf files (this is the longest step)
help show this message
clean remove all *.jid files
cleanall REMOVE ALL GENERATED FILES
PARAMETERS
============
ROOT DIR : /work/everhartlab/kamvarz/test/read-processing
TEMP DIR : $TMPDIR
INDEX DIR : bt2-index
PREFIX : Ssc
GENOME : genome/GCA_001857865.1_ASM185786v1_genomic.fna.gz genome/sclerotinia_sclerotiorum_mitochondria_2_contigs.fasta.gz
READS : reads/SS.11.01 reads/SS.11.02
Modules
============
BOWTIE : bowtie/2.2
TRIMMOD : trimmomatic/0.36
SAMTOOLS : samtools/1.3
VCFTOOLS : vcftools/0.1
PICARD : picard/2.9
GATK : gatk/3.4
A workflow with two samples and two genome files (full and mitochondria) would look like this:
- Blue nodes: user-supplied files
- Grey nodes: makefile-generated folders
- Pink nodes: derived files per sample
- Purple/periwinkle nodes: indexed genomic data
- brown nodes: scripts in the scripts directory
You can find the pdf version of this graph here. The color palette used for the graph is from the GrandBudapest2 palette of the The Wes Anderson Palettes by Karthik Ram.
If you want to add steps/rules to the workflow, you should first be familiar or comfortable with Makefiles. Here are some helpful guides:
Many of the recipes in the makefile take the form of:
# Converting In to Out --------------------------------------------------------
$(OUT_DIR)/%.out : $(FROM_DIR)/%.in scripts/in-to-out.sh | $(OUT_DIR) $(OUT_RUN)
sbatch \
-D $(ROOT_DIR) \
-J IN-TO-OUT \
--dependency=afterok:$$(bash scripts/get-job.sh $<.jid) \
-o $(OUT_RUN)/$*.out \
-e $(OUT_RUN)/$*.err \
scripts/in-to-out.sh $< | cut -c 21- > $@.jid
Where sbatch
is being used on the
HCC to submit a custom SLURM script scripts/in-to-out.sh
, which will convert
.in
files to .out
files. One of the first things to note are the trailing
backslashes. These are continuation lines for BASH, so when it runs it will run
like this:
sbatch -D $(ROOT_DIR) -J IN-TO-OUT --dependency=afterok:$$(bash scripts/get-job.sh $<.jid) -o $(OUT_RUN)/$*.out -e $(OUT_RUN)/$*.err scripts/in-to-out.sh $< | cut -c 21- > [email protected]
All jobs run from sbatch
will return the following:
Submitted batch job XXXXX
where XXXXX
is the JOBID. Since make does not know anything about jobs
running on SLURM, we need to tell SLURM to hold off on running a given job
until the dependent jobs are done. We take advantage of this by piping this to
cut -c 21- > [email protected]
, which creates a file with the JOBID that we can use in a
downstream process.
The arguments to sbatch
are:
- D Root directory for the system
- J jobname (we try to make them match the script names)
- --dependency This stipulates that this job can only run if the the
dependency from the .jid file has been fulfilled. Note that multiple jid
files can be specified. Note that because
$
is special in make, we have to use another$
to escape it. - -o,-e the stdout and stderr files, respectively.
You may have noticed that this uses automatic makefile variables such as $<
and $@
, which stand for the first dependency and target, respectively. Depending on the needs of various scripts, we need different variables. It would be a good idea to keep that page open as a reference when creating new
steps.
Here's an example from the script:
# Sorting and Converting to BAM files -----------------------------------------
$(BAM_DIR)/%_nsort.bam : $(SAM_DIR)/%.sam scripts/sam-to-bam.sh | $(BAM_DIR) $(BAM_RUN)
sbatch \
-D $(ROOT_DIR) \
-J SAM-TO-BAM \
--dependency=afterok:$$(bash scripts/get-job.sh $<.jid) \
-o $(BAM_RUN)/$*_nsort.out \
-e $(BAM_RUN)/$*_nsort.err \
scripts/sam-to-bam.sh $(<D) $(@D) $* $(SAMTOOLS) | \
cut -c 21- > $@.jid
In this example, we are using the sam-to-bam.sh
script to convert sam files to bam files. If you run the script by itself, it
will tell you the requirements:
Usage: sam-to-bam.sh <SAMDIR> <BAMDIR> <BASE> <SAMTOOLS>
<SAMDIR> - the directory for the samfiles
<BAMDIR> - the directory for the bamfiles
<BASE> - base name for the sample (e.g. SS.11.01)
<SAMTOOLS> - the samtools module (e.g. samtools/1.3)
- bt2-index/: genome index files generated via
make index
- TRIM/: home for the trimmed reads via trimmomatic
- SAMS/: mapped sam files generated via
make map
- BAMS/: filtered bam files
- GVCF/:
*.g.vcf
and*.vcf
files generated via GATK - runs/: std out and std err of runs