A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.
- Requirements
- Citation
- Quickstart
- Pipeline Overview
- Module Descriptions
- GatherSampleEvidence - Raw callers and evidence collection
- EvidenceQC - Batch QC
- TrainGCNV - gCNV model creation
- GatherBatchEvidence - Batch evidence merging, BAF generation, and depth callers
- ClusterBatch - Site clustering
- GenerateBatchMetrics - Site metrics
- FilterBatch - Filtering
- MergeBatchSites - Cross-batch site merging
- GenotypeBatch - Genotyping
- RegenotypeCNVs - Genotype refinement (optional)
- MakeCohortVcf - Cross-batch integration, complex event resolution, and VCF cleanup
- Module 07 - Downstream Filtering
- AnnotateVcf - Annotation
- Module 09 - QC and Visualization
- Additional modules - Mosaic and de novo
- CI/CD
- Troubleshooting
- A Google Cloud account.
- A workflow execution system supporting the Workflow Description Language (WDL), either:
- Recommended: MELT. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
- Recommended: cromshell for interacting with a dedicated Cromwell server.
- Recommended: WOMtool for validating WDL/json files.
- Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with bwa-mem. BAMs must also be indexed.
- Indexed GVCFs produced by GATK HaplotypeCaller, or a jointly genotyped VCF.
- Family structure definitions file in PED format. Sex aneuploidies (detected in EvidenceQC) should be entered as sex = 0.
We recommend filtering out samples with a high percentage of improperly paired reads (>10% or an outlier for your data) as technical outliers prior to running GatherSampleEvidence. A high percentage of improperly paired reads may indicate issues with library prep, degradation, or contamination. Artifactual improperly paired reads could cause incorrect SV calls, and these samples have been observed to have longer runtimes and higher compute costs for GatherSampleEvidence.
Sample IDs must:
- Be unique within the cohort
- Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)
Sample IDs should not:
- Contain only numeric characters
- Be a substring of another sample ID in the same cohort
- Contain any of the following substrings:
chr
,name
,DEL
,DUP
,CPX
,CHROM
The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.
Sample IDs are provided to GatherSampleEvidence directly and need not match sample names from the BAM/CRAM headers or GVCFs. GetSampleID.wdl
can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, this script transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in GatherBatchEvidence.
The following inputs will need to be updated with the transformed sample IDs:
- Sample ID list for GatherSampleEvidence or GatherBatchEvidence
- PED file
If using a SNP VCF in GatherBatchEvidence, it does not need to be re-headered; simply provide the vcf_samples
argument.
Please cite the following publication: Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.
Additional references: Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.
There are two scripts for running the full pipeline:
wdl/GATKSVPipelineBatch.wdl
: Runs GATK-SV on a batch of samples.wdl/GATKSVPipelineSingleSample.wdl
: Runs GATK-SV on a single sample, given a reference panel
Example workflow inputs can be found in /input_templates
or /test_input_templates
. All required resources are available in public Google buckets.
Important: The example input files contain MELT inputs that are NOT public (see Requirements). These include:
GATKSVPipelineSingleSample.melt_docker
andGATKSVPipelineBatch.melt_docker
- MELT docker URI (see Docker readme)GATKSVPipelineSingleSample.ref_std_melt_vcfs
- Standardized MELT VCFs (GatherBatchEvidence)
The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting GATKSVPipelineBatch.use_melt
to false
.
Important: The following parameters must be set when certain input data is in requester pays (RP) buckets:
GATKSVPipelineSingleSample.requester_pays_cram
andGATKSVPipelineBatch.GatherSampleEvidenceBatch.requester_pays_crams
- set toTrue
if inputs are CRAM format and in an RP bucket, otherwiseFalse
.GATKSVPipelineBatch.GATKSVPipelinePhase1.gcs_project_for_requester_pays
- set to your Google Cloud Project ID if gVCFs are in an RP bucket, otherwise omit this parameter.
We recommend running the pipeline on a dedicated Cromwell server with a cromshell client. A batch run can be started with the following commands:
> mkdir gatksv_run && cd gatksv_run
> mkdir wdl && cd wdl
> cp $GATK_SV_ROOT/wdl/*.wdl .
> zip dep.zip *.wdl
> cd ..
> bash scripts/inputs/build_default_inputs.sh -d $GATK_SV_ROOT
> cp $GATK_SV_ROOT/inputs/GATKSVPipelineBatch.ref_panel_1kg.json GATKSVPipelineBatch.my_run.json
> cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip
where cromwell_config.json
is a Cromwell workflow options file. Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs).
The pipeline consists of a series of modules that perform the following:
- GatherSampleEvidence: SV evidence collection, including calls from a configurable set of algorithms (Delly, Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
- EvidenceQC: Dosage bias scoring and ploidy estimation
- GatherBatchEvidence: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
- ClusterBatch: Variant clustering
- GenerateBatchMetrics: Variant filtering metric generation
- FilterBatch: Variant filtering; outlier exclusion
- GenotypeBatch: Genotyping
- MakeCohortVcf: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
- Module 07: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
- AnnotateVcf: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
- Module 09: Visualization, including scripts that generates IGV screenshots and rd plots.
- Additional modules to be added: de novo and mosaic scripts
Repository structure:
/input_templates
: Example workflow parameter file templates for running gCNV training, GATK-SV batch mode, and GATK-SV single-sample mode. Generate parameter files from templates usingscripts/inputs/build_default_inputs.sh
./input_values
: Example workflow input values to populate templates. Please note that file inputs may not be publicly available./dockerfiles
: Resources for building pipeline docker images (see readme)/wdl
: WDLs running the pipeline. There is a master WDL for running each module, e.g.ClusterBatch.wdl
./scripts
: scripts for running tests, building dockers, and analyzing cromwell metadata files/src
: main pipeline scripts/RdTest
: scripts for depth testing/sv-pipeline
: various scripts and packages used throughout the pipeline/svqc
: Python module for checking that pipeline metrics fall within acceptable limits/svtest
: Python module for generating various summary metrics from module outputs/svtk
: Python module of tools for SV-related datafile parsing and analysis/WGD
: whole-genome dosage scoring scripts
/test_input_templates
: WDL test parameter file templates. Generate parameter files from templates usingscripts/inputs/build_default_inputs.sh
.
A minimum cohort size of 100 with roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using GATKSVPipelineBatch.wdl
.
For larger cohorts, samples should be split up into batches of about 100-500 samples. Refer to the Batching section for further guidance on creating batches.
The pipeline should be executed as follows:
- Modules GatherSampleEvidence and EvidenceQC can be run on arbitrary cohort partitions
- Modules GatherBatchEvidence, ClusterBatch, GenerateBatchMetrics, and FilterBatch are run separately per batch
- GenotypeBatch is run separately per batch, using filtered variants (FilterBatch output) combined across all batches
- MakeCohortVcf and beyond are run on all batches together
Note: GatherBatchEvidence requires a trained gCNV model.
For larger cohorts, samples should be split up into batches of about 100-500 samples with similar characteristics. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in EvidenceQC. An example batching process is outlined below:
- Divide the cohort into PCR+ and PCR- samples
- Partition the samples by median coverage from EvidenceQC, grouping samples with similar median coverage together. The end goal is to divide the cohort into roughly equal-sized batches of about 100-500 samples; if your partitions based on coverage are larger or uneven, you can partition the cohort further in the next step to obtain the final batches.
- Optionally, divide the samples further by dosage score (WGD) from EvidenceQC, grouping samples with similar WGD score together, to obtain roughly equal-sized batches of about 100-500 samples
- Maintain a roughly equal sex balance within each batch, based on sex assignments from EvidenceQC
GATKSVPipelineSingleSample.wdl
runs the pipeline on a single sample using a fixed reference panel. An example reference panel containing 156 samples from the NYGC 1000G Terra workspace is provided with inputs/GATKSVPipelineSingleSample.ref_panel_1kg.na12878.json
.
Custom reference panels can be generated by running GATKSVPipelineBatch.wdl
and TrainGCNV.wdl
and using the outputs to replace the following single-sample workflow inputs:
GATKSVPipelineSingleSample.ref_ped_file
:batch.ped
- Manually created (see data requirements)GATKSVPipelineSingleSample.contig_ploidy_model_tar
:batch-contig-ploidy-model.tar.gz
- gCNV contig ploidy model (TrainGCNV)GATKSVPipelineSingleSample.gcnv_model_tars
:batch-model-files-*.tar.gz
- gCNV model tarballs (TrainGCNV)GATKSVPipelineSingleSample.ref_pesr_disc_files
-sample.disc.txt.gz
- Paired-end evidence files (GatherSampleEvidence)GATKSVPipelineSingleSample.ref_pesr_split_files
-sample.split.txt.gz
- Split read evidence files (GatherSampleEvidence)GATKSVPipelineSingleSample.ref_panel_bincov_matrix
:batch.RD.txt.gz
- Read counts matrix (GatherBatchEvidence)GATKSVPipelineSingleSample.ref_panel_del_bed
:batch.DEL.bed.gz
- Depth deletion calls (GatherBatchEvidence)GATKSVPipelineSingleSample.ref_panel_dup_bed
:batch.DUP.bed.gz
- Depth duplication calls (GatherBatchEvidence)GATKSVPipelineSingleSample.ref_samples
- Reference panel sample IDsGATKSVPipelineSingleSample.ref_std_manta_vcfs
-std_XXX.manta.sample.vcf.gz
- Standardized Manta VCFs (GatherBatchEvidence)GATKSVPipelineSingleSample.ref_std_melt_vcfs
-std_XXX.melt.sample.vcf.gz
- Standardized Melt VCFs (GatherBatchEvidence)GATKSVPipelineSingleSample.ref_std_wham_vcfs
-std_XXX.wham.sample.vcf.gz
- Standardized Wham VCFs (GatherBatchEvidence)GATKSVPipelineSingleSample.cutoffs
:batch.cutoffs
- Filtering cutoffs (FilterBatch)GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff
:genotype_pesr.pesr_sepcutoff.txt
- Genotyping cutoffs (GenotypeBatch)GATKSVPipelineSingleSample.genotype_pesr_depth_sepcutoff
:genotype_pesr.depth_sepcutoff.txt
- Genotyping cutoffs (GenotypeBatch)GATKSVPipelineSingleSample.genotype_depth_pesr_sepcutoff
:genotype_depth.pesr_sepcutoff.txt
- Genotyping cutoffs (GenotypeBatch)GATKSVPipelineSingleSample.genotype_depth_depth_sepcutoff
:genotype_depth.depth_sepcutoff.txt
- Genotyping cutoffs (GenotypeBatch)GATKSVPipelineSingleSample.PE_metrics
:pe_metric_file.txt
- Paired-end evidence genotyping metrics (GenotypeBatch)GATKSVPipelineSingleSample.SR_metrics
:sr_metric_file.txt
- Split read evidence genotyping metrics (GenotypeBatch)GATKSVPipelineSingleSample.ref_panel_vcf
:batch.cleaned.vcf.gz
- Final output VCF (MakeCohortVcf)
Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small, relatively homogenous cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend training a separate model for each batch or group of batches with similar dosage score (WGD). The model may be trained on all or a subset of the samples to which it will be applied; a reasonable default is 100 randomly-selected samples from the batch (the random selection can be done as part of the workflow by specifying a number of samples to the n_samples_subsample
input parameter in /wdl/TrainGCNV.wdl
).
The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl
, and example input files for each module can be found in /test_input_templates
.
Formerly Module00a
Runs raw evidence collection on each sample with the following SV callers: Manta, Wham, and/or MELT. Delly can be enabled but is no longer officially supported. For guidance on pre-filtering prior to GatherSampleEvidence
, refer to the Sample Exclusion section.
Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.
- Per-sample BAM or CRAM files aligned to hg38. Index files (
.bai
) must be provided if using BAMs.
- Caller VCFs (Delly, Manta, MELT, and/or Wham)
- Binned read counts file
- Split reads (SR) file
- Discordant read pairs (PE) file
Formerly Module00b
Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.
For large cohorts, this workflow can be run on arbitrary cohort partitions of up to about 500 samples. Afterwards, we recommend using the results to divide samples into smaller batches (~100-500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.
We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.
- Read count files (GatherSampleEvidence)
- (Optional) SV call VCFs (GatherSampleEvidence)
- Per-sample dosage scores with plots
- Median coverage per sample
- Ploidy estimates, sex assignments, with plots
- (Optional) Outlier samples detected by call counts
The purpose of sample filtering at this stage after EvidenceQC is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of FilterBatch) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:
- Look at the X and Y ploidy plots, and check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
- Look at the dosage score (WGD) distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
- Look at the low outliers for each SV caller (samples with much lower than typical numbers of SV calls per contig for each caller). An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
- Look at the high outliers for each SV caller and optionally filter outliers; samples with many more SV calls than average may be poor quality.
Trains a gCNV model for use in GatherBatchEvidence. The WDL can be found at /wdl/TrainGCNV.wdl
. See the gCNV training overview for more information.
- GatherSampleEvidence
- (Recommended) EvidenceQC
- Read count files (GatherSampleEvidence)
- Contig ploidy model tarball
- gCNV model tarballs
Formerly Module00c
Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.
- GatherSampleEvidence
- (Recommended) EvidenceQC
- gCNV training
- PED file (updated with EvidenceQC sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)
- Per-sample indexed GVCFs generated with HaplotypeCaller (
gvcfs
input), or a jointly-genotyped VCF (position-sharded,snp_vcfs
input orsnp_vcfs_shard_list
input). The jointly-genotyped VCF may contain multi-allelic sites and indels, but only biallelic SNVs will be used by the pipeline. We recommend shards of 10 GB or less to lower compute time and resources. - Read count, BAF, PE, and SR files (GatherSampleEvidence)
- Caller VCFs (GatherSampleEvidence)
- Contig ploidy model and gCNV model files (gCNV training)
- Combined read count matrix, SR, PE, and BAF files
- Standardized call VCFs
- Depth-only (DEL/DUP) calls
- Per-sample median coverage estimates
- (Optional) Evidence QC plots
Formerly Module01
Clusters SV calls across a batch.
- Standardized call VCFs (GatherBatchEvidence)
- Depth-only (DEL/DUP) calls (GatherBatchEvidence)
- Clustered SV VCFs
- Clustered depth-only call VCF
Formerly Module02
Generates variant metrics for filtering.
- Combined read count matrix, SR, PE, and BAF files (GatherBatchEvidence)
- Per-sample median coverage estimates (GatherBatchEvidence)
- Clustered SV VCFs (ClusterBatch)
- Clustered depth-only call VCF (ClusterBatch)
- Metrics file
Formerly Module03
Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at wdl/FilterBatch.wdl
, or it can be run in three steps to enable tuning of outlier filtration cutoffs. The three subworkflows are:
- FilterBatchSites: Per-batch variant filtration
- PlotSVCountsPerSample: Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
- FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate
outlier_cutoff_nIQR
based on the SV count plots and outlier previews from step 2.
- Batch PED file
- Metrics file (GenerateBatchMetrics)
- Clustered SV and depth-only call VCFs (ClusterBatch)
- Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
- Filtered depth-only call VCF with outlier samples excluded
- Random forest cutoffs file
- PED file with outlier samples excluded
Formerly MergeCohortVcfs
Combines filtered variants across batches. The WDL can be found at: /wdl/MergeBatchSites.wdl
.
- List of filtered PESR VCFs (FilterBatch)
- List of filtered depth VCFs (FilterBatch)
- Combined cohort PESR and depth VCFs
Formerly Module04
Genotypes a batch of samples across unfiltered variants combined across all batches.
- Batch PESR and depth VCFs (FilterBatch)
- Cohort PESR and depth VCFs (MergeBatchSites)
- Batch read count, PE, and SR files (GatherBatchEvidence)
- Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
- Filtered depth-only call VCF with outlier samples excluded
- PED file with outlier samples excluded
- List of SR pass variants
- List of SR fail variants
- (Optional) Depth re-genotyping intervals list
Formerly Module04b
Re-genotypes probable mosaic variants across multiple batches.
- Per-sample median coverage estimates (GatherBatchEvidence)
- Pre-genotyping depth VCFs (FilterBatch)
- Batch PED files (FilterBatch)
- Cohort depth VCF (MergeBatchSites)
- Genotyped depth VCFs (GenotypeBatch)
- Genotyped depth RD cutoffs file (GenotypeBatch)
- Re-genotyped depth VCFs
Formerly Module0506
Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.
- GenotypeBatch
- (Optional) RegenotypeCNVs
- RD, PE and SR file URIs (GatherBatchEvidence)
- Batch filtered PED file URIs (FilterBatch)
- Genotyped PESR VCF URIs (GenotypeBatch)
- Genotyped depth VCF URIs (GenotypeBatch or RegenotypeCNVs)
- SR pass variant file URIs (GenotypeBatch)
- SR fail variant file URIs (GenotypeBatch)
- Genotyping cutoff file URIs (GenotypeBatch)
- Batch IDs
- Sample ID list URIs
- Finalized "cleaned" VCF and QC plots
Module 07 (in development)
Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.
Filterings methods include:
- minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found here:
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
- BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
- FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
- FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation
AnnotateVcf (in development)
Formerly Module08Annotation
Add annotations, such as the inferred function and allele frequencies of variants, to final vcf.
Annotations methods include:
- Functional annotation - annotate SVs with inferred function on protein coding regions, regulatory regions such as UTR and Promoters and other non coding elements;
- Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.
- Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.
Module 09 (in development)
Visualize SVs with IGV screenshots and read depth plots.
Visualization methods include:
- RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.
- IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.
- Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.
This repository is maintained following the norms of
continuous integration (CI) and continuous delivery (CD).
GATK-SV CI/CD is developed as a set of Github Actions
workflows that are available under the .github/workflows
directory. Please refer to the workflow's README
for their current coverage and setup.
- Default pipeline settings are tuned for batches of 100 samples. Larger batches or cohorts may require additional VM resources. Most runtime attributes can be modified through the
RuntimeAttr
inputs. These are formatted like this in the json:
"MyWorkflow.runtime_attr_override": {
"disk_gb": 100,
"mem_gb": 16
},
Note that a subset of the struct attributes can be specified. See wdl/Structs.wdl
for available attributes.
Example error message from GatherSampleEvidence.MELT.GetWgsMetrics
:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: The requested index 701766 is out of counter bounds. Possible cause of exception can be wrong READ_LENGTH parameter (much smaller than actual read length)
This error message was observed for a sample with an average read length of 117, but for which half the reads were of length 90 and half were of length 151. As a workaround, override the calculated read length by providing a read_length
input of 151 (or the expected read length for the sample in question) to GatherSampleEvidence
.