Variant calling identifies SNP and Indel sites that vary from the reference genome. Genotyping determines the genotype for each individual at called variant sites. Variant calling and genotyping from next-generation sequencing provides a detailed background. This script takes a single pair of paired end fastq files from whole genome or exome sequencing data which have been previously QC'd and performs all steps necessary to produce a vcf or gvcf file containing germline SNPs and Indels.
The GATK best practice pipeline for calling germline variants is illustrated in the image below. This script fastqToVar.pl
includes all steps up to and including "Call Variants Per-Sample" which produces a vcf file which must be filtered prior to down stream processing. Filtering has not been included in this pipeline as it is subject to user preference and needs.
1 Software
2 User required input
- A single pair of paired end fastq files
- Read group information
3 Known variants
- Already hardcoded into script for mouse and macaque
- To update or use other organisms, download known variants from Ensembl
- If you want to use the script on a non-model organism see "No Excuses"
- Indexed vcf with:
bgzip file.vcf && tabix tabix -p vcf file.vcf.gz
4 Reference genome
- Already hardcoded into script for mouse and macaque
- Downloaded fasta from Ensembl
- Indexed fasta with:
bwa index -a fastafile.fa && /
samtools faidx fastafile.fa && /
gatk CreateSequenceDictionary -R fastafile.fa
5 Gene intervals
- Already hardcoded into script for mouse and macaque
- Downloaded chr, start, end to gene.intervals.bed from biomart
- Make a bed file for each chromosome with:
grep '^1\s' gene.intervals.bed > gene.intervals.1.bed
To subset your dataset to quickly test changes you make to the pipeline:
zcat filename_1.fastq.gz | sed -n 1,1000p > test_1.fastq && /
zcat filename_2.fastq.gz | sed -n 1,1000p > test_2.fastq && /
gzip *.fastq
The quality of the data should be assessed prior to running the script fastqToVar.pl
to see if trimming is required.
mkdir -m 777 qc && fastqc *.fastq --noextract -o ./qc -t 64 && multiqc -f -ip *
Run the script fastqToVar.pl
with the option '-of vcf' on the paired end fastq files of a single sample to output a vcf file. However, if their are multiple samples, run the script with the option '-of gvcf' on each to output a gvcf file for each sample. The multiple gvcf files can then be consolidated and used to joint call variants. This is a continuation of the GATK best practice pipeline for calling germline variants illustrated in the image above. This tutorial describes how to consolidate gvcfs for joint-calling.
The Broad Institute provide a detailed explanation of read groups and why they are needed. In summary, a read group is a set of reads that were generated from a single run of a sequencing instrument, so providing read group information assists the pipeline in the identification and removal of batch effects. Read group information can be found in the fastq filename and header and also in the sample sheet or qcstats sheet if available as shown in the following example:
- fastq filename: WTCHG_461109_50_1.fastq.gz
- fastq headers: @K00150:286:HNGMNBBXX:5:XXXX:XXXX:XXXX 1:N:0:XXXX
@instrument | run_number | flowcell_ID | lane | tile | x-pos | y-pos | read | is_filtered | control_number | index(barcode) |
---|---|---|---|---|---|---|---|---|---|---|
@K00150 | 286 | HNGMNBBXX | 5 | XXXX | XXXXX | XXXXX | 1 | N | 0 | XXXX |
XXXX
indicates entries that differentiate reads within a read group so is not needed for read group information
- sample/qcstats sheets:
Index | Tag | Readgroup | Sample_name | Sample_ID | Library | Type | Genome | Project | Date |
---|---|---|---|---|---|---|---|---|---|
50 | GTCTGTCA | WTCHG_461109_50 | mpc372-2.5e | POT5490A2 | 106/18_MPX_10nM | SureSelectXT | mm10 | P180007 | 2018-02-08 |
To view the barcodes present in the fastq file:
grep '^@K00150:286' WTCHG_461109_50_2.fastq | cut -d : -f 10 | sort | uniq -c | sort -nr > barcodes.txt
The read group information required by the script fastqToVar.pl
can now be extracted for this example data to build the example run command:
Internal GATK option | Arguments from data | fastqToVar.pl option | @RG BAM header |
---|---|---|---|
--READ_GROUP_NAME | WTCHG_461109_50 | -id | ID |
--SAMPLE_NAME | mpc372-2.5e | -sm | SM |
--LIBRARY_NAME | 106/18_MPX_10nM | not used | LB |
--PLATFORM | illumina | -pl | PL |
--PLATFORM_UNIT | HNGMNBBXX.GTCTGTCA.5 | -pu | PU |
--SEQUENCING_CENTER | WTCHG | -cn | CN |
--RUN_DATE | 2018-02-08 | -dt | DT |
The string used for --PLATFORM_UNIT is constructed by concatenating: flowcellID.[barcode|date|readgroup].lane
All options are required.
- -m: 'wes' or 'wgs' for whole exome or whole shotgun sequencing respectively
- -sp: currently only supports 'mouse' or 'macaque'
- -of: 'vcf' or 'gvcf' as described in single v multi sample analysis
- -fq1: full path to the first of the paired fastq files
- -fq2: full path to the second of the paired fastq files
- -o: full path to output folder, must end with "/"
- -id: read group name for GATK from read group information or unknown
- -pu: platform unit for GATK from read group information or unknown
- -sm: sample name for GATK from read group information or unknown
- -pl: platform for GATK from read group information or unknown
- -cn: sequencing centre code for GATK from read group information or unknown
- -dt: run date for GATK from read group information or unknown
fastqToVcfOnGrid.pl /
-m wes /
-sp mouse /
-of vcf /
-fq1 WTCHG_461109_50_1.fastq.gz /
-fq2 WTCHG_461109_50_2.fastq.gz /
-o /path/to/output/folder/ / # Note the required "/"
-id WTCHG_461109_50 /
-pu HNGMNBBXX.GTCTGTCA.5 /
-sm mpc372-2.5e /
-lb 106/18_MPX_10nM /
-pl illumina /
-cn WTCHG /
-dt 2018-02-08
Once the script finishes the read group information can be printed from the BAM file with:
samtools view -H WTCHG_461109_50.bam | grep '@RG'
@RG ID:WTCHG_461109_50 SM:mpc372-2.5e LB:106/18_MPX_10nM PL:illumina PU:HNGMNBBXX.GTCTGTCA.5 CN:WTCHG DT:2018-02-08
1 On utah
- gatk --list
- /NGS/Software/gatk-4.0.3.0/gatk --list
- java -jar /NGS/Software/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar --list
2 On grid
- /usr/java/latest8/bin/java -jar /NGS/Software/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar --list