This repository contains the Kids First Data Resource Center (DRC) Somatic Variant Workflow, which includes somatic variant (SNV), copy number variation (CNV), and structural variant (SV) calls. Benchmarking of our SNV callers and consensus methods can be found here. The somatic variant workflow takes aligned cram input and performs somatic variant calling using Strelka2, Mutect2, Lancet, and VarDict Java, CNV estimation using Control-FREEC, CNVkit, and GATK, and SV calls using Manta. For whole genome sequencing data, the workflow will also predict extra chromosomal DNA (ecDNA) using AmpliconArchitect Somatic variant call results are annotated with hotspots, assigned population frequencies using gnomAD AF, calculated gene models using Variant Effect Predictor (VEP), then added an additional MAF output using a modified version of Memorial Sloan Kettering Cancer Center's (MSKCC) vcf2maf. See annotation workflow doc for more details on annotation.
If you would like to run this workflow using the CAVATICA public app, a basic primer on running public apps can be found here.
Alternatively, if you'd like to run it locally using cwltool
, a basic primer on that can be found here and combined with app-specific info from the readme below.
This workflow is the current production workflow, equivalent to this CAVATICA public app.
Caller | CNV | SNV | SV | ecDNA |
---|---|---|---|---|
AmpliconArchitect | x | |||
CNVkit | x | |||
Control-FREEC | x | |||
GATK CNV | x | |||
Lancet | x | |||
Manta | x | |||
Mutect2 | x | |||
Strelka2 | x | |||
THeTa2 | x | |||
VarDict | x |
- Strelka2
v2.9.3
, from Illumina, calls SNVs and insertions/deletions (INDEL)- See the subworkflow doc for more information
- Mutect2
v4.1.1.0
, from the Broad Institute, calls SNVs, multi-nucleotide variants (MNV: equal length substitutions over 1 base long), and INDELs- See the subworkflow doc for more information
- Lancet
v1.0.7
, from the New York Genome Center (NYGC), calls SNVs, MNVs, and INDELs- See the subworkflow doc for more information
- VarDict Java
v1.7.0
, from AstraZeneca, calls SNVs, MNVs, INDELs and more- See the subworkflow doc for more information
Each caller has a different approach to variant calling, and together one can glean confident results. To synthesize the results from these four callers we have developed a consensus calling workflow. For details, please refer to the workflow documentation.
Somatic variant call results are annotated with hotspots, assigned population frequencies using gnomAD AF, calculated gene models using VEP, then added an additional MAF output using a modified version of MSKCC vcf2maf.
The somatic annotation workflow is included in this repository as a submodule found here. To get the necessary files when downloading the repository, make sure use recursive flag for git clone: git clone --recursive
. For details on the workflow and its components, please refer to the workflow documentation. The source repository for the annotation workflows can be found here.
- Manta
v1.4.0
, from Illumina, is used to call SVs. Output is also in VCF format, with calls filtered onPASS
.
- AnnotSV
v3.1.1
, from Université de Strasbourg, is used to annotate the calls in the Manta PASS VCF.
- Control-FREEC
v11.6
, from the Bioinformatics Laboratory of Institut Curie and Boeva Lab, detects copy-number changes and allelic imbalances (including LOH)- The tool portion of the workflow is a port from the Velsera team, with a slight tweak in image outputs
- Also, the workflow wrapper limits what inputs and outputs are used based on our judgement of utility
- CNVkit
v0.9.3
, from UCSF, infers and visualizes copy number - THeTa2
v0.7.1
, from Brown University and CHOP D3b, is used to inform and adjust copy number calls from CNVkit with purity estimations - GATK CNV
v4.2.4.1
, from the Broad Institute, calls somatic CNVs using a Panel of Normals (PON) created using this workflow- Read more about our the workflow and our PON recommendations in the subworkflow doc
Control-FREEC and CNVkit can take advantage of B-Allele Frequency (BAF) integration for copy number genotype estimation and increased CNV accuracy. That is done using the following process:
- create a gVCF using either:
- Sentieon Alignment and gVCF Workflow
- For details, please refer to the workflow documentation
- To run on CAVATICA, see our public app
- BWA Alignment and GATK HaplotypeCaller Workflow
- For details, please refer to the workflow documentation
- To run on CAVATICA, see our public app
- Sentieon Alignment and gVCF Workflow
- create a VCF by perform genotype refinement on the gVCF:
- Single Sample Genotyping Workflow
- To run on CAVATICA, see our public app
- Single Sample Genotyping Workflow
- Pass that VCF to the
b_allele
workflow input
- AmpliconArchitect, from UCSD, is used to predict ecDNAs
- Workflow documentation available here
The somatic variant workflow is designed to be able to process either WGS or
WXS inputs. This functionality comes from usage of the wgs_or_wxs
input enum.
Depending on what is provided for this input, the tool will set the appropriate
default values and check that the user has provided the correct inputs. For
example, if the user sets the input to WGS the lancet_padding
value will be
set to 300; alternatively, if the user sets the input to WXS the
lancet_padding
value will be defaulted to 0. In either case, the user can
override the defaults simply by providing their own value for lancet_padding
in the inputs.
The wgs_or_wxs
flag also plays an integral role in the preparation of
intervals throughout the pipeline.
In order to perform variant calling, the user must provide some of these intervals:
calling_regions
: genomic regions over which the SNV and SV callers will be runblacklist_regions
: genomic regions to remove from the SNV and SV calling regionscoding_sequence_regions
: genomic regions covering the coding sequence regions of the genome to which the input BAM/CRAM/SAM was alignedcnv_blacklist_regions
: genomic regions to remove from the CNV calling regions
The calling is the base calling regions for the sample. These regions are
further modified by the two blacklist files. The blacklist regions first
removed from the calling regions and the resulting regions are what are passed
to callers. SNV and SV callers receive the regions that are calling_regions
with blacklist_regions
removed. CNV callers receive the regions that are
calling_regions
with cnv_blacklist_regions
removed.
In general, what is expected of the user to provide for these intervals is:
- WGS
calling_regions
: Whole genome calling regions (for GATK these are the regions of the FASTA onchr1-22,X,Y
where there is no span of more than 200 Ns; for completeness we addchrM
)blacklist_regions
: Regions that the user is confident are of no interest to SNV and SV calling, if any. Telomeres, centromeres, high signal regions, repetitive regions, etc.cnv_blacklist_regions
: Regions of thecalling_regions
the user is confident are of no interest to CNV calling, if any. These regions include chrM, telomeres, centromeres, high signal regions, repetitive regions, etc.coding_sequence_regions
: Regions of the genome that GENCODE defines as CDS; can be determined from their GTF
- WXS
calling_regions
: The experimental bait capture regionsblacklist_regions
: Regions of thecalling_regions
that the user is confident are of no interest to SNV and SV calling, if anycnv_blacklist_regions
Regions of the capture region that the user is confident are of no interest to CNV calling, if anycoding_sequence_regions
: This input is not used in WXS runs. Providing it does nothing
The coding sequence regions are only relevant for calling WGS data using Lancet. Presenting Lancet with intervals spanning the whole genome will lead to extreme runtime and cost. By starting from an limited calling region and supplementing with calls from Mutect2 and/or Strelka2, Lancet is able to run on the scale of hours rather than days.
There's a lot more to talk about with intervals, how they are prepared, and some finer details. If you are interested, see our interval preparation walkthrough.
By default, the workflow will attempt to run every caller. However, the user can modify this behavior using various inputs. These inputs, and their behavior, are as follows:
wgs_or_wxs
: WGS or WXS mode.- WXS mode disables AmpliconArchitect, even if
run_amplicon_architect
is set to true - Activates appropriate interval processing
- WXS mode disables AmpliconArchitect, even if
run_vardict
: Set to false to disable VarDict.- Warning! Enabling THeTa2 with VarDict disabled will throw an error!
run_mutect2
: Set to false to disable Mutect2.- Warning! Enabling Lancet in WGS mode with Mutect2 and Strelka2 disabled will throw an error!
run_strelka2
: Set to false to disable Strelka2.- Warning! Enabling Lancet in WGS mode with Strelka2 and Mutect2 disabled will throw an error!
run_lancet
: Set to false to disable Lancet.run_controlfreec
: Set to false to disable Control-FREEC.run_cnvkit
: Set to false to disable CNVkit.- Warning! Enabling AmpliconArchitect with CNVkit disabled will throw an error!
- Warning! Enabling THeTa2 with CNVkit disabled will throw an error!
run_amplicon_architect
: Set to false to disable AmpliconArchitect.run_theta2
: Set to false to disable THeTa2.run_manta
: Set to false to disable Manta.run_gatk_cnv
: Set to false to disable GATK CNV.
The first step of the workflow will check the user inputs and throw errors for impossible scenarios:
- Enabling Lancet in WGS without enabling either Mutect2 or Strelka2 or user providing existing calls to the
lancet_input_vcf
input - Enabling AmpliconArchitect without enabling CNVkit or providing the Mosek config file
- Enabling THeTa2 without enabling both VarDict and CNVkit
- Enabling GATK CNV without providing a Panel of Normals
-
For input cram files, be sure to have indexed them beforehand
-
When in doubt, all of our reference files can be obtained from here: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
-
For Control-FREEC, it is highly recommended that you supply a VCF file with germline calls, GATK HaplotypeCaller recommended. Please also make sure the index for this file is available. Also, a range of input ploidy possibilities for the inputs are needed. You can simply use
2
, or put in a range, as an array, like 2, 3, 4. For mate orientation, you will need to specify, the drop down and tool doc explains your options. -
As a CAVATICA app, default references for hg38 are already pre-populated, as well as some default settings - i.e., number of threads, coefficient of variation input for Control-FREEC, and
PASS
filter tool mode. -
If running Lancet for WGS, and you have existing variant calls that you'd like to use from a previous Strelka2 and/or Mutect2 run for example, provide them to
lancet_input_vcf
input. Otherwise you must also enablerun_strelka2
and/orrun_mutect2
-
select_vars_mode
: On occasion, using GATK'sSelectVariants
tool will fail, so a simplegrep
mode onPASS
can be used instead. Related,bcftools_filter_vcf
is built in as a convenience in case your b allele frequency file has not been filtered onPASS
. You can use theinclude_expression
Filter="PASS"
to achieve this. -
The
SM:
sample IDs in the input alignment files must matchinput_tumor_name
andinput_normal_name
. If you need the output VCF sample names to differ from the input, enter what is currently inSM:
into theold_tumor_name
andold_normal_name
fields, and enter the desired sample names ininput_tumor_name
andinput_normal_name
. The workflow will rename the samples in the VCF for you. -
Our reference inputs can be obtained with an account from CAVATICA. Below are some directions on how to make them:
reference_fasta
: Homo_sapiens_assembly38.fasta - need a valid google account, this is a link to the resource bundle from Broad GATKreference_dict
: Homo_sapiens_assembly38.dict - need a valid google account, this is a link to the resource bundle from Broad GATKcalling_regions
: wgs_calling_regions.hg38.interval_list - need a valid google account, this is a link to the resource bundle from Broad GATK. To create our canonical calling intervals, edit this file by leaving only entries related to chr1-22,X,Y,M. M may need to be added.cnv_blacklist_regions
:somatic-hg38_CNV_and_centromere_blacklist.hg38liftover.list
Blacklist regions that include centromeres to exclude from CNV callingcoding_sequence_regions
:GRCh38.gencode.v31.CDS.merged.bed
For Lancet WGS, it's highly recommended to use CDS bed as the starting point and supplement with the regions of calls from Strelka2 & Mutect2. Our CDS regions were obtained from GENCODE, release 31 using this GTF file gencode.v31.primary_assembly.annotation.gtf.gz and parsing features forUTR
,start codon
,stop codon
, andexon
, then using bedtools sort and merge after converting coordinates into bed format.cnvkit_annotation_file
: refFlat_HG38.txt gunzip this file from UCSCaf_only_gnomad_vcf
: af-only-gnomad.hg38.vcf.gz - need a valid google account, this is a link to the best practices google bucket from Broad GATK.exac_common_vcf
: small_exac_common_3.hg38.vcf.gz - need a valid google account, this is a link to the best practices google bucket from Broad GATK.vep_cache
:homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
from VEP release 105 - variant effect predictor cache. Current production workflow uses this version.genomic_hotspots
:tert.bed
. Tab-delimited BED formatted file(s) containing hg38 genomic positions corresponding to hotspotsprotein_snv_hotspots
:kfdrc_protein_snv_cancer_hotspots_20240718.txt
. Column-name-containing, tab-delimited file(s) containing protein names and amino acid positions corresponding to hotspots. File header contains generation historyprotein_indel_hotspots
: protein_indel_cancer_hotspots_v2.ENS105_liftover.tsv. Column-name-containing, tab-delimited file(s) containing protein names and amino acid position ranges corresponding to hotspots.echtvar_anno_zips
:gnomad_3_1_1.vwb_subset.echtvar_0_1_9.zip
.Echtvar annotation ZIP file. For more info on how this file was created, see our Ectvar documentation.custom_enst
:kf_isoform_override.tsv
. As of VEP 104, several genes have had their canonical transcripts redefined. While the VCF will have all possible isoforms, this affects MAF file output and may results in representative protein changes that defy historical expectationsfuncotator_data_sources_tgz
: funcotator_dataSources.v1.6.20190124s.tar.gz - need a valid google account, this is a link to pre-packaged datasources from the Broad Institute. Any of the tar.gz files will do.annotsv_annotations_dir_tgz
: [annotsv_311_annotations_dir.tgz] - These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation. Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.aa_data_repo
: GRCh38_indexed.tar.gz - AmpliconArchitect reference TAR ball. Only needed/used if input is WGSmosek_license_file
: mosek.lic - required if AmpliconArchitect is run. License is good for one year and is renewable
-
There are some flags that we set as defaults. Feel free to alter these if your experiment calls for something else:
cfree_coeff_var
: 0.05cfree_contamination_adjustment
: FALSEbcftools_public_filter
:'FILTER="PASS"|INFO/HotSpotAllele=1'
. This phrase will allowPASS
only orHotSpotAllele
variants into the public version of variant call output.disable_hotspot_annotation
: falsemaf_center
:"."
. Sequencing center of variant calledaa_data_ref_version
:"GRCh38"
. Genome reference version used
-
There are some flags that need to be set by the user at runtime. These are our recommendations:
gatk_filter_name
:["NORM_DP_LOW", "GNOMAD_AF_HIGH"]
. These correspond to the recommended filter expression.gatk_filter_expression
:["vc.getGenotype('<input_normal_name> ').getDP() <= 7"), "gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 && gnomad_3_1_1_FILTER=='PASS'"]
. Array of filter expressions to establish criteria to tag variants with. See annotation subworkflow docs for a more detailed explanation. For more information on filter expressions, see the GATK JEXL docs.extra_arg
: This can be used to add special params to Strelka2. It is currently more of an "unsticking param". For edge cases where strelka2 seems to hang, setting this to--max-input-depth 10000
can balance performance and consistency in results
-
Output files (Note, all VCF files that don't have an explicit index output have index files output as as secondary file. In other words, they will be captured at the end of the workflow):
- Simple variant callers
- Strelka2
strelka2_prepass_vcf
: Combined SNV + INDEL file with renamed Sample IDs. Has all softFILTER
values generated by variant callerstrelka2_protected_outputs
: Array of files containing MAF format of PASS hits,PASS
VCF with annotation pipeline softFILTER
-added values, and VCF indexstrelka2_public_outputs
: Same as above, except MAF and VCF have had entries with softFILTER
values removed
- Mutect2
mutect2_filtered_vcf
: VCF with SNV, MNV, and INDEL variant calls. Contains all softFILTER
values generated by variant caller. Use this file if you believe important variants are being left out when using the algorithm'sPASS
filtermutect2_protected_outputs
: Array of files containing MAF format of PASS hits,PASS
VCF with annotation pipeline softFILTER
-added values, and VCF indexmutect2_public_outputs
: Same as above, except MAF and VCF have had entries with softFILTER
values removed
- VarDict Java
vardict_prepass_vcf
: VCF with SNV, MNV, and INDEL variant calls. Contains all softFILTER
values generated by variant caller. Use this file if you believe important variants are being left out when using the algorithm'sPASS
filtervardict_protected_outputs
: Array of files containing MAF format of PASS hits,PASS
VCF with annotation pipeline softFILTER
-added values, and VCF indexvardict_public_outputs
: Same as above, except MAF and VCF have had entries with softFILTER
values removed
- Lancet
lancet_prepass_vcf
: VCF with SNV, MNV, and INDEL variant calls. Contains all softFILTER
values generated by variant caller, Use this file if you believe important variants are being left out when using the algorithm'sPASS
filterlancet_protected_outputs
: Array of files containing MAF format of PASS hits,PASS
VCF with annotation pipeline softFILTER
-added values, and VCF indexlancet_public_outputs
: Same as above, except MAF and VCF have had entries with softFILTER
values removed
- Strelka2
- Structural variant callers
- Manta
manta_pass_vcf
: SV call filtered onPASS
, from mantamanta_pass_tbi
: Index file of above bgzipped VCFmanta_prepass_vcf
: SV results with allFILTER
categories for manta. Use this file if you believe important variants are being left out when using the algorithm'sPASS
filterannotsv_annotated_calls
: This file contains all records from themanta_pass_vcf
that AnnotSV could annotate.annotsv_unannotated_calls
: This file contains all records from themanta_pass_vcf
that AnnotSV could not annotate.
- Manta
- Copy number variation callers
- ControlFREEC
ctrlfreec_pval
: CNV calls with copy number and p value confidence, a qualitative "gain, loss, neutral" assignment, and genotype with uncertainty assigned from Control-FREEC. See author manual for more details.ctrlfreec_config
: Config file used to run Control-FREEC. Has some useful information on what parameters were used to run the tool.ctrlfreec_pngs
: Plots of BAF log2 ratio and ratio of tumor/normal copy number coverage. Pink line in the middle of ratio plots is the median ratio.ctrlfreec_bam_ratio
: BAM ratio text file. Contain ratio, median ratio (used to informctrlfreec_pval
), CNV estimates, BAF estimate, and genotype estimate.ctrlfreec_bam_seg
: In-house generated SEG file based on ratio file. Provided as a convenience for compatibility with tools that require inputs in legacy microarray format.ctrlfreec_baf
: BAF estimations.ctrlfreec_info
: Contains useful run time information, like ploidy used for analysis, and window size
- CNVkit
cnvkit_cnr
: Copy number ratiocnvkit_cnn_output
: Normal/control sample copy numbercnvkit_calls
: Tumor/sample copy numbercnvkit_metrics
: Basic SEG count and call statscnvkit_gainloss
: Per-gene log2 ratiocnvkit_seg
: Classic microarray-style SEG filetheta2_calls
: Purity-adjusted CNVkit copy number calls based on THeTa2 resultstheta2_seg
: Purity-adjusted CNVkit SEG file based on THeTa2 resultstheta2_subclonal_results
: THeTa2 subclone purity resultstheta2_subclonal_cns
: THeTa2 sublone CNStheta2_subclone_seg
: THeTa2 subclone SEG file
- GATK CNV
gatk_copy_ratio_segments_tumor
: Called copy-ratio-segments file for the tumor sample. This is a tab-separated values (TSV) file with SAM-style header containing a read group sample name, a sequence dictionary, a row specifying the column headers output byCallCopyRatioSegments
and the corresponding entry rows.gatk_copy_ratio_segments_normal
: Called copy-ratio-segments file for the normal sample. This is a tab-separated values (TSV) file with SAM-style header containing a read group sample name, a sequence dictionary, a row specifying the column headers output byCallCopyRatioSegments
and the corresponding entry rows.gatk_cnv_denoised_tumor_plot
: Denoised-plot file that covers the entire range of the copy ratiosgatk_cnv_denoised_normal_plot
: Denoised-plot file that covers the entire range of the copy ratiosgatk_cnv_funcotated_called_file_tumor
: TSV where each row is a segment and the annotations are the covered genes and which genes+exon is overlapped by the segment breakpoints.gatk_cnv_funcotated_called_gene_list_file_tumor
: TSV where each row is a gene and the annotations are the covered genes and which genes+exon is overlapped by the segment breakpoints.
- ControlFREEC
- ecDNA Prediction
aa_summary
: Summary for all amplicons detected by AmpliconArchitectaa_cycles
: Text file for each amplicon listing the edges in the breakpoint graph, their categorization (sequence, discordant source) and their copy countsaa_graph
: A text file for each amplicon listing the edges in the breakpoint graph, their categorization (sequence, discordant, concordant, source) and their copy countsaa_sv_png
: PNG image file displaying the SV view of AmpliconArchitectaa_classification_profiles
: abstract classification of the amplicon
- Simple variant callers
-
Docker images - the workflow tools will automatically pull them, but as a convenience are listed below:
Strelka2
: pgc-images.sbgenomics.com/d3b-bixu/strelkaadd common fields to Strelka2
: pgc-images.sbgenomics.com/d3b-bixu/add-strelka2-fields:1.0.0Mutect2
andGATK SNV
tools: pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.1.0GATK CNV
: broadinstitute/gatk:4.2.4.1Lancet
: pgc-images.sbgenomics.com/d3b-bixu/lancet:1.0.7VarDict Java
: pgc-images.sbgenomics.com/d3b-bixu/vardict:1.7.0Control-FREEC
: images.sbgenomics.com/vojislav_varjacic/control-freec-11-6:v1CNVkit
: images.sbgenomics.com/milos_nikolic/cnvkit:0.9.3THetA2
: pgc-images.sbgenomics.com/d3b-bixu/theta2:0.7.1samtools
: pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9Variant Effect Predictor
: ensemblorg/ensembl-vep:release_105.0Kids Fist VCF2MAF
: pgc-images.sbgenomics.com/d3b-bixu/kf_vcf2maf:v1.0.3Manta
: pgc-images.sbgenomics.com/d3b-bixu/manta:1.4.0bcftools
andvcftools
: pgc-images.sbgenomics.com/d3b-bixu/bvcftools:latestannotsv
: pgc-images.sbgenomics.com/d3b-bixu/annotsv:3.1.1AmpliconArchitect
: jluebeck/prepareaa:v0.1203.10Echtvar
: pgc-images.sbgenomics.com/d3b-bixu/echtvar:0.1.9
-
For highly complex samples, some tools have shown themselves to require memory allocation adjustments: Manta, GATK LearnReadOrientationModel, GATK GetPileupSummaries, GATK FilterMutectCalls and VarDict. Optional inputs exist to expand the memory allocation for these jobs: manta_memory, learnorientation_memory, getpileup_memory, filtermutectcalls_memory, and vardict_ram, respectively. For the java tools (VarDict and GATK), these values represent limits on the memory that can be used for the respective jobs. Tasks will go to these values and not exceed it. They may succeed or fail, but they will not exceed the limit established. The memory allocations for these is hardcapped. The memory allocation option for Manta, conversely, is a soft cap. The memory requested will be allocated for the job on a particular machine but once the task is running on the machine it may exceed that requested value. For example, if Manta's memory allocation is set to 10 GB it will have 10 GB allocated to it at task creation, but, if the task ends up running on a machine with more memory available, the task may use it. Setting a value here for Manta will not prevent Manta from taking more than that value. The memory usage in Manta is limited by the machine hardware. As such the option for Manta memory allocation is described as soft cap. For more information on Manta resource usage see their documentation.
-
The optional
b_allele
file can be generated using our Single Sample Genotyping Workflow.
- dockerfiles: https://github.com/d3b-center/bixtools