This documentation outlines how we created our gnomAD annotation reference. While the gnomAD source itself is very large and comprehensive, this reference is meant to add the "bare minimum" as a useful start for variant filtering out-of-the-box with a called VCF. The main steps for reference creation are:
- Download and pipe to bcftools and vt to grab desired fields and to normalize calls (see https://genome.sph.umich.edu/wiki/Vt#Normalization)
- Use pysam to add custom fields including an INFO field for the
FILTER
value from the source file, a popmax AF for non-cancer populations with no bottlenecks (as defined here) and a popmax with bottleneck populations - Create an echtvar reference for blazing fast annotation of VCF files with this custom reference
Used docker image pgc-images.sbgenomics.com/d3b-bixu/vcfutils:latest
, installed curl
.
Desired fields to subset from gnomAD are as follows:
AC
AN
AF
nhomalt
AC_popmax
AN_popmax
AF_popmax
nhomalt_popmax
AC_controls_and_biobanks
AN_controls_and_biobanks
AF_controls_and_biobanks
AF_non_cancer
primate_ai_score
splice_ai_consequence
AF_non_cancer_afr
AF_non_cancer_ami
AF_non_cancer_asj
AF_non_cancer_eas
AF_non_cancer_fin
AF_non_cancer_mid
AF_non_cancer_nfe
AF_non_cancer_oth
AF_non_cancer_raw
AF_non_cancer_sas
AF_non_cancer_amr
Example command: Created a chromosome chr1-22,X,Y list and used xargs to run this script
cat chr_list | xargs -IFN -P 12 scripts/dl_subset_gnomad.sh FN
Use scripts/custom_vcf_info.py
to add new custom fields, calculated as follows:
GNOMAD_FILTER
=FILTER
valueAF_non_cancer_popmax
max value of:when availableAF_non_cancer_afr AF_non_cancer_amr AF_non_cancer_eas AF_non_cancer_nfe AF_non_cancer_sas
AF_non_cancer_all_popmax
get max value of bottleneck populations:and set to the greater of calculatedAF_non_cancer_ami AF_non_cancer_asj AF_non_cancer_fin AF_non_cancer_mid AF_non_cancer_oth
AF_non_cancer_popmax
or max of bottleneck populations. Example command: Installed python packagepysam==0.22.0
, used the same chr list and ran:
cat chr_list.txt | xargs -IFN -P 8 python3 custom_vcf_info.py --input_vcf gnomad.genomes.v3.1.1.sites.FN.bcftools_INFO_subset.vt_norm.vcf.gz --output_basename gnomad.genomes.v3.1.1.sites.FN.custom --threads 2
Used docker image pgc-images.sbgenomics.com/d3b-bixu/echtvar:0.1.9
First, create a config JSON file. See here for generic details. This config was used. This config will prepend gnomad_3_1_1_
to all field names for source clarity upon annotation.
echtvar \
encode \
gnomad.v3.1.1.custom.echtvar.zip \
gnomad_update.json \
*.custom.INFO_added.vcf.gz