This workflow is used to annotate germline outputs with popular annotation resources. This includes using VEP to annotate with ENSEMBL v105 reference as well using bcftools to add further annotation described below.
- Prefilter input VCF (optional) to remove variants that are undesired to go into annotation
- Normalize VCF
- Strip pre-existing annotations (optional) to prevent downstream conflicts
- Annotate with VEP 105. Optional plugins include:
- dbnsfp
- cadd
- Use echtvar to annotate with an external reference (default gnomad 3.1.1)
- Use bcftools to annotate with another external reference (optional clinvar)
- Simple rename outputs step
By default, the workflow will add the following annotations:
This is added on using variant effect predictor to use the ENSEMBL reference to add gene model information as well as additional resources provided in their cache. It's highly recommended that when you download their cache, to convert and index. It will speed up annotation and reduce memory footprint significantly. Annotation resources in the cache include:
# CACHE UPDATED 2022-09-26 18:18:29
assembly GRCh38
bam GCF_000001405.39_GRCh38.p13_knownrefseq_alns.bam
polyphen b
sift b
source_assembly GRCh38.p13
source_gencode GENCODE 39
source_genebuild 2014-07
source_polyphen 2.2.2
source_refseq 2021-05-28 21:42:08 - GCF_000001405.39_GRCh38.p13_genomic.gff
source_sift sift5.2.2
species homo_sapiens
variation_cols chr,variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,var_synonyms,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC 94
source_HGMD-PUBLIC 20204
source_ClinVar 105202106
source_dbSNP 154
source_1000genomes phase3
source_ESP V2-SSA137
source_gnomAD r2.1.1
regulatory 1
cell_types A549,A673,B,B_(PB),CD14+_monocyte_(PB),CD14+_monocyte_1,CD4+_CD25+_ab_Treg_(PB),CD4+_ab_T,CD4+_ab_T_(PB)_1,CD4+_ab_T_(PB)_2,CD4+_ab_T_(Th),CD4+_ab_T_(VB),CD8+_ab_T_(CB),CD8+_ab_T_(PB),CMP_CD4+_1,CMP_CD4+_2,CMP_CD4+_3,CM_CD4+_ab_T_(VB),DND-41,EB_(CB),EM_CD4+_ab_T_(PB),EM_CD8+_ab_T_(VB),EPC_(VB),GM12878,H1-hESC_2,H1-hESC_3,H9_1,HCT116,HSMM,HUES48,HUES6,HUES64,HUVEC,HUVEC-prol_(CB),HeLa-S3,HepG2,K562,M0_(CB),M0_(VB),M1_(CB),M1_(VB),M2_(CB),M2_(VB),MCF-7,MM.1S,MSC,MSC_(VB),NHLF,NK_(PB),NPC_1,NPC_2,NPC_3,PC-3,PC-9,SK-N.,T_(PB),Th17,UCSF-4,adrenal_gland,aorta,astrocyte,bipolar_neuron,brain_1,cardiac_muscle,dermal_fibroblast,endodermal,eosinophil_(VB),esophagus,foreskin_fibroblast_2,foreskin_keratinocyte_1,foreskin_keratinocyte_2,foreskin_melanocyte_1,foreskin_melanocyte_2,germinal_matrix,heart,hepatocyte,iPS-15b,iPS-20b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,large_intestine,left_ventricle,leg_muscle,lung_1,lung_2,mammary_epithelial_1,mammary_epithelial_2,mammary_myoepithelial,monocyte_(CB),monocyte_(VB),mononuclear_(PB),myotube,naive_B_(VB),neuron,neurosphere_(C),neurosphere_(GE),neutro_myelocyte,neutrophil_(CB),neutrophil_(VB),osteoblast,ovary,pancreas,placenta,psoas_muscle,right_atrium,right_ventricle,sigmoid_colon,small_intestine_1,small_intestine_2,spleen,stomach_1,stomach_2,thymus_1,thymus_2,trophoblast,trunk_muscle
source_regbuild 1.0
var_type tabix
Using echtvar, we annotate from a custom implementation of gnomAD v3.1.1 the following population statistics (columns are give a gnomad_3_1_1_
prefix to denote source):
gnomad_3_1_1_AC
gnomad_3_1_1_AN
gnomad_3_1_1_AF
gnomad_3_1_1_nhomalt
gnomad_3_1_1_AC_popmax
gnomad_3_1_1_AN_popmax
gnomad_3_1_1_AF_popmax
gnomad_3_1_1_nhomalt_popmax
gnomad_3_1_1_AC_controls_and_biobanks
gnomad_3_1_1_AN_controls_and_biobanks
gnomad_3_1_1_AF_controls_and_biobanks
gnomad_3_1_1_AF_non_cancer
gnomad_3_1_1_primate_ai_score
gnomad_3_1_1_splice_ai_consequence
gnomad_3_1_1_AF_non_cancer_afr
gnomad_3_1_1_AF_non_cancer_ami
gnomad_3_1_1_AF_non_cancer_asj
gnomad_3_1_1_AF_non_cancer_eas
gnomad_3_1_1_AF_non_cancer_fin
gnomad_3_1_1_AF_non_cancer_mid
gnomad_3_1_1_AF_non_cancer_nfe
gnomad_3_1_1_AF_non_cancer_oth
gnomad_3_1_1_AF_non_cancer_raw
gnomad_3_1_1_AF_non_cancer_sas
gnomad_3_1_1_AF_non_cancer_amr
gnomad_3_1_1_AF_non_cancer_popmax
gnomad_3_1_1_AF_non_cancer_all_popmax
gnomad_3_1_1_FILTER
This resource compiles from dozens of sources annotations for ~84M SNVs. By default, from this resource, we recommend the following:
SIFT4G_pred
Polyphen2_HDIV_pred
Polyphen2_HVAR_pred
LRT_pred
MutationTaster_pred
MutationAssessor_pred
FATHMM_pred
PROVEAN_pred
VEST4_score
VEST4_rankscore
MetaSVM_pred
MetaLR_pred
MetaRNN_pred
M-CAP_pred
REVEL_score
REVEL_rankscore
PrimateAI_pred
DEOGEN2_pred
BayesDel_noAF_pred
ClinPred_pred
LIST-S2_pred
Aloft_pred
fathmm-MKL_coding_pred
fathmm-XF_coding_pred
Eigen-phred_coding
Eigen-PC-phred_coding
phyloP100way_vertebrate
phyloP100way_vertebrate_rankscore
phastCons100way_vertebrate
phastCons100way_vertebrate_rankscore
TWINSUK_AC
TWINSUK_AF
ALSPAC_AC
ALSPAC_AF
UK10K_AC
UK10K_AF
gnomAD_exomes_controls_AC
gnomAD_exomes_controls_AN
gnomAD_exomes_controls_AF
gnomAD_exomes_controls_nhomalt
gnomAD_exomes_controls_POPMAX_AC
gnomAD_exomes_controls_POPMAX_AN
gnomAD_exomes_controls_POPMAX_AF
gnomAD_exomes_controls_POPMAX_nhomalt
Interpro_domain
GTEx_V8_gene
GTEx_V8_tissue
Using a VEP plugin, we add Combined Annotation Dependent Depletion scores
A curated resource with annotations of clinical significance per variant. Note, for this pipeline, the default reference was modified by:
- Switching from
1
chromosome nomenclature tochr1
, and especiallyMT
->chrM
- Removing the entry assigned to
NW_009646201.1
. It's a benign it and also not present in our fasta reference. We recommend the following:
ALLELEID
CLNDN
CLNDNINCL
CLNDISDB
CLNDISDBINCL
CLNHGVS
CLNREVSTAT
CLNSIG
CLNSIGCONF
CLNSIGINCL
CLNVC
CLNVCSO
CLNVI
This is a custom reference generated by the authors of the tool linked above. It contains only exonic snps. To utilize the full capabilities of their classification, you must run the tool.
indexed_reference_fasta
files: Homo_sapiens_assembly38.fasta, Homo_sapiens_assembly38.dict, Homo_sapiens_assembly38.fasta.faiinput_vcf
: Input vcf file to annotateoutput_basename
: string prefix of outputstool_name
: short descriptive string of tool output being annotated
echtvar_anno_zips
file array: Annotation ZIP files for echtvar annovep_cache
file:homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
merged
boolean: Set to true if merged cache used, default:true
run_cache_existing
boolean: Run the check_existing flag for cache, default:true
run_cache_af
boolean: Run the allele frequency flags for cache, default:true
run_stats
boolean: Create stats file. Disable for speed, default:false
bcftools_prefilter_csv
: CSV of bcftools filter params if you want to prefilter before annotationdisable_normalization
boolean: Skip normalizing if input is already normed, default isfalse
bcftools_strip_columns
: CSV string of columns to strip if needed to avoid conflict, i.e INFO/AFvep_ram
int: In GB, may need to increase this value depending on the size/complexity of input, default:48
vep_cores
int: Number of cores to use. May need to increase for really large inputs, default:32
,vep_buffer_size
int: Increase or decrease to balance speed and memory usage, default:100000
bcftools_annot_clinvar_columns
: CSV string of columns from annotation to port into the input VCF, default:INFO/ALLELEID,INFO/CLNDN,INFO/CLNDNINCL,INFO/CLNDISDB,INFO/CLNDISDBINCL,INFO/CLNHGVS,INFO/CLNREVSTAT,INFO/CLNSIG,INFO/CLNSIGCONF,INFO/CLNSIGINCL,INFO/CLNVC,INFO/CLNVCSO,INFO/CLNVI
clinvar_annotation_vcf
files: clinvar_20220507_chr_fixed.vcf.gz, clinvar_20220507_chr_fixed.vcf.gz.tbidbnsfp
file: dbNSFP4.3a_grch38.gz, dbNSFP4.3a_grch38.gz.tbi, dbNSFP4.3a_grch38.readme.txtdbnsfp_fields
string: CSV string with desired fields to annotate. Use ALL to grab all, default:SIFT4G_pred,Polyphen2_HDIV_pred,Polyphen2_HVAR_pred,LRT_pred,MutationTaster_pred,MutationAssessor_pred,FATHMM_pred,PROVEAN_pred,VEST4_score,VEST4_rankscore,MetaSVM_pred,MetaLR_pred,MetaRNN_pred,M-CAP_pred,REVEL_score,REVEL_rankscore,PrimateAI_pred,DEOGEN2_pred,BayesDel_noAF_pred,ClinPred_pred,LIST-S2_pred,Aloft_pred,fathmm-MKL_coding_pred,fathmm-XF_coding_pred,Eigen-phred_coding,Eigen-PC-phred_coding,phyloP100way_vertebrate,phyloP100way_vertebrate_rankscore,phastCons100way_vertebrate,phastCons100way_vertebrate_rankscore,TWINSUK_AC,TWINSUK_AF,ALSPAC_AC,ALSPAC_AF,UK10K_AC,UK10K_AF,gnomAD_exomes_controls_AC,gnomAD_exomes_controls_AN,gnomAD_exomes_controls_AF,gnomAD_exomes_controls_nhomalt,gnomAD_exomes_controls_POPMAX_AC,gnomAD_exomes_controls_POPMAX_AN,gnomAD_exomes_controls_POPMAX_AF,gnomAD_exomes_controls_POPMAX_nhomalt,Interpro_domain,GTEx_V8_gene,GTEx_V8_tissue
cadd_indels
file: CADDv1.6-38-gnomad.genomes.r3.0.indel.tsv.gz, CADDv1.6-38-gnomad.genomes.r3.0.indel.tsv.gz.tbicadd_snvs
file: CADDv1.6-38-whole_genome_SNVs.tsv.gz, CADDv1.6-38-whole_genome_SNVs.tsv.gz.tbiintervar
file: Exons.all.hg38.intervar.2021-07-31.vcf.gz, Exons.all.hg38.intervar.2021-07-31.vcf.gz.tbi
annotated_vcf
file: VCF file with all applied annotations
Currently, VEP not only provides gene model annotation, but also allows for additional annotations to be added. Therefore, for CADD and dbNSFP, existing files were formatted in order to use VEP plugins. Please see their documentation for information on how the references were generated.