Python package to run Join Calling on population at NGI (National Genomics Infrastructure) Sweden.
The script joint_variant_calling.py
implements the GATK-Workflow described in
https://www.broadinstitute.org/gatk/guide/article?id=3893
With option --mixed-positions
VQSR step is executed following best-practice described in
http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr
in order to avoid the problem with MIXED positions (i.e., position where an indel overlaps a SNP). This is the mode used to run SweGen dataset.
Samples to be be join called can be specified in two ways:
- in the sample field of the config.yaml file to be provided as input. Complete path to the gvcf file needs to be provided
- a file named 00_samples.txt in the cwd. If present samples specified in this file will be used instead of those speccified in the config
If run like
python joint_variant_calling.py --config config.yaml
it creates the following folder structure
00_intervals
: optional see Intervals section00_samples.txt
: samples that are processed (i.e., samples that are join called)01_CombineGVCFs
: step one is CombineGVCFs, batching gvcf files together02_GenotypeGVCFs
: then run GenotypeGVCFs03_CatVariants
: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)04_SelectVariants
: extract SNPs and INDELs and run eveluation tool from GATK to prepare VQSR05_VariantRecalibrator
: first step of VQSR06_ApplyRecalibration
: second step of VQSR
If run like
python joint_variant_calling.py --config config.yaml --mixed-positions
it creates the following folder structure
00_intervals
: optional see Intervals section00_samples.txt
: samples that are processed (i.e., samples that are join called)01_CombineGVCFs
: step one is CombineGVCFs, batching gvcf files together02_GenotypeGVCFs
: then run GenotypeGVCFs03_CatVariants
: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)04_VQSR
: VQSR step executed as explained in http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr
Folders 01, 02, ..., 06 are all organised in the same way:
sbatch
: sbatch files to be submitted to the slurm queue (Uppmax assumed)std_err
: output of the standard errorstd_out
: output of the standard outputVCF
: contains the gvcf or vcf files (in general results files, in case of05_VariantRecalibrator
contains recalibration tables)
Folder 01_CombineGVCFs
contains an extra sub-folder:
batches
: used to restart joint calling if new samples are available [UNSTABLE: UNDER TEST]
If run like
python joint_variant_calling.py --config config.yaml --resume
it resumes the joint calling adding new samples and recomputing only the last (if needed) and the new batches (in 01_CombineGVCFs
)
The most time consuming steps of the workflow are 01_CombineGVCFs
and 02_GenotypeGVCFs
. These two steps can be parallised running the commands on non overlapping sections of the genome.
For this purpose an utility script is provided:
python utils/create_interval_list.py --dict intervals/human_g1k_v37.dict --block 1000000000
the file intervals/human_g1k_v37.dict
can be found in this repo.
If run in a directory (e.g. 00_intervals) it creates the intrvals files (.intervals). This folder need to be provided in the config.yaml file. Running the example command will create 4 blocks,
first 3 of 1Gbp and the last one of ~200Mbp.
The example
directory in this repo contains a dry-run of joint_variant_calling.py run on intervals generated by this command, on 7 samples, in batches of 4 (i.e.,this creates two batches, one of 4 and one of 3 samples)