Skip to content

Latest commit

 

History

History
157 lines (122 loc) · 10.5 KB

README.md

File metadata and controls

157 lines (122 loc) · 10.5 KB

The ANNOVAR databases for CHM13v2.0 reference!

😊 Welcome! 😊

This repository describes the ANNOVAR databases💻 for hs1 (CHM13v2.0)🧬 and how to use it. Feel free to leave questions or problems while using this database in the Issues menu.

How to generate ANNOVAR databases
The original files were listed below to match the names of ANNOVAR databases, and the formatted accordingly.
If the original files are based on GRCh38 or references other than CHM13v2.0, the files were lifted over to CHM13 using crossmap. The chain file can be downloaded from the [CHM13 GitHub page](https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/grch38-chm13v2.chain)

[ Genome-based DB ]

hs1_refGene.txt: ANNOVAR homepage
hs1_curGene5.2.txt: CHM13 github - This contains curated annotations of the ampliconic genes on the Y chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 annotation.

  • If the original file was formatted in GFF, I transformed it to GTF and then used gtfToGenePred to convert it into GenePred format.

  • The gene annotation databases in ANNOVAR website used to come with ${prefix}Mrna.fa. This file was generated using retrieve_seq_from_fasta.pl script.

    retrieve_seq_from_fasta.pl --format refGene --seqfile chm13v2.0.fa hs1_refGene.txt --out hs1_refGeneMrna.fa
    
  • format(hs1_curGene.txt):

       1  XR_002958507.2  chr1    -       6046    13941   13941   13941   4       6046,12077,13444,13679, 6420,12982,13579,13941, 0       LOC124900618    none    none    -1,-1,-1,>
       2  XR_007068557.1_1        chr1    +       15079   21429   21429   21429   2       15079,20565,    15564,21429,    0       LOC124905335_1  none    none    -1,-1,
       3  XM_047436352.1  chr1    -       20528   37628   20949   37628   5       20528,28446,34957,36085,37442,  21087,28626,35059,37081,37628,  0       LOC112268260    incmpl  i>
       4  NR_125957.1     chr1    -       52978   54612   54612   54612   3       52978,53559,54521,      53422,53826,54612,      0       LOC101928626    none    none    -1,-1,-1,
       5  NM_001005221.2  chr1    -       111939  112877  111939  112877  1       111939, 112877, 0       OR4F29  incmpl  incmpl  0,
       6  XR_001743907.1_1        chr1    -       115045  117120  117120  117120  2       115045,116798,  115300,117120,  0       LOC107986552_1  none    none    -1,-1,
    

[ Filter-based DB ]

hs1_dbsnp156.txt: NCBI DBsnp ftp server
hs1_clinvar_20231217.txt: CHM13 github
hs1_${population}.sites.2023_12.txt: CHM13 github
- 1KGP allele frequency recalled on T2T-CHM13v2.0. Now available for all chromosomes, for the entire 3,202 samples or the unrelated 2504 samples. (popultation : ALL, AFR, AMR, EAS, EUR, and SAS)

  • format(hs1_ALL.sites.2023_12.txt) :
    chr1    131     A       C       0.00278552
    chr1    131     A       T       0.00278552
    chr1    878     A       AACCCTAACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTC       0.00019976
    chr1    884     AACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTC A       0.00020136
    

[ Region-based DB ]

hs1_gwas_20231207.txt: GWAS Catalog v1.0-associations_e110_r2023-12-07
hs1_cenSat.txt: CHM13 github - A more comprehensive centromere/satellite repeat annotation.
hs1_nonSyntenic.txt: CHM13 github - Regions non-syntenic (unique) compared to GRCh38.
hs1_hg38_issues.txt: (CHM13v1 publication)[https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/publications/Nurk_2021/fig1/]
hs1_sraccess.txt: CHM13 github - short read accessible regions on CHM13
hs1_sraccess_hg38.txt: CHM13 amazon download server short read accessible regions on GRCh38 reference then, liftovered to CHM13
hs1_sraccess_hs1Only.txt: short read accessible regions only in CHM13, not in GRCh38

  • format(hs1_cenSat.txt) :
    1       chr1    116796047       121405145       ct_1_1(p_arm)   100     .       116796047       121405145       224,224,224
    1       chr1    121405145       121406286       censat_1_1(rnd-6_family-4384)   100     .       121405145       121406286       0,204,204
    1       chr1    121406286       121619169       ct_1_2  100     .       121406286       121619169       224,224,224
    1       chr1    121619169       121625213       hor_1_1(S3C1H2-A,B,C)   100     .       121619169       121625213       255,146,0
    1       chr1    121625213       121667941       hor_1_2(S3C1H2-A,B)     100     .       121625213       121667941       255,146,0
    1       chr1    121667941       121788213       hor_1_3(S3C1H2-B)       100     .       121667941       121788213       255,146,0
    1       chr1    121788213       121790362       ct_1_3  100     .       121788213       121790362       224,224,224
    

Usage

1. Downloading the Database

Pre-built databases and an example file are available to download from here, along with original resources we used for generating the databases and operation details for each database on T2T-CHM13 Databases and operation.

The directory structure should be like this:

- hs1
  - hs1_${Database_name_1}.txt
  - hs1_${Database_name_2}.txt
  - hs1_${Database_name_3}.txt
  ...

After downloading the databases, unzip each database into a .txt file:

bgzip -d filename.txt.gz

If you intend to annotate variants using the gene annotation databases such as hs1_curGene.txt.gz and hs1_refGene.txt.gz, please download ${Database_name}Mrna.fa.gz file as well.

There are index file(.txt.idx) for filter-based databases, such as allele frequency(hs1_ALL.sites.2023_12.txt) and dbSNP(hs1_snp156.txt). It is not a mandatory file, but it would be helpful to reduce computing time during variant annotation.

2. Make your data compatible with annovar

Make your target VCF file compatible with ANNOVAR before annotating with this database. Use the convert2annovar.pl script to convert your VCF file to ANNOVAR input format:

perl convert2annovar.pl -format vcf4 ${prefix}.vcf > ${prefix}.avinput

Adjust the parameters based on your specific requirements. If you would like to learn more of the detailed command line usage, please visit the ANNOVAR website.

An example file(example.chr22.inp) used in this description is available here. This file has been converted from VCF to Annovar-compatible format and contains only the data from chromosome 22.

3. Annotate Variants

Annotate your variants using ANNOVAR with the following command:

annovarDB="/path/to/annovar/hs1" # The folder where include hs1/ folder
build="hs1"
outPrefix="OUT" # the prefix you want. The ouput name would be ${OUT}.hs1_multianno.txt

table_annovar.pl \
--thread 10 \ 
example.chr22.inp \ # Input file from step 2
$annovarDB/hs1 \ # annovar database direcoty
-buildver $build \ # hs1 (database prefix)
-out ${outPrefix} \ # output prefix (example.chr22.out for the example output)
-remove \
-protocol refGene,curGene5.2,dbsnp156,1000g2023dec_all,1000g2023dec_afr,1000g2023dec_amr,1000g2023dec_eas,1000g2023dec_eur,1000g2023dec_sas,clinvar_20231217,gwas_20231207,nonSyntenic,hg38_issues,feat,cenSat,sraccess,sraccess_hg38,sraccess_hs1Only \
-operation g,g,f,f,f,f,f,f,f,f,r,r,r,r,r,r,r,r \
-arg ',,,,,,,,,,,,,,,,,'

Keep in mind that operations such as g, f, or r needs to match each database. You can find the correct operation for each annotation source on T2T-CHM13 Databases and operation.

Variant Effect Predictor (VEP)

There are numerous plugins available for VEP, such as AlphaMissense, CADD, SpliceVault, and others. You can find them on the VEP plugin homepage. Here, we introduce how to run VEP using a VCF and GFF file based on the CHM13 reference. You can download the reference genome and gene annotation in GFF format from the CHM13 GitHub repository. The plugins are optional, so you can use others as needed; this is just an example.

Generating and indexing the GFF file for VEP

ori_gff=chm13v2.0_RefSeq_Liftoff_v5.1.gff3
grep -v "#" $ori_gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > chm13v2.0_RefSeq_Liftoff_v5.1.gff_forVEP.gz
tabix -p gff chm13v2.0_RefSeq_Liftoff_v5.1.gff_forVEP.gz

Running VEP

vcf=test.vcf.gz 
ref=chm13v2.0.fa # whatever reference you've used for aligning.
gff=chm13v2.0_RefSeq_Liftoff_v5.2.gff_forVEP.gz
VEP_CACHEDIR=./vepCash

mkdir -p $VEP_CACHEDIR

vep --fork 50 \
 -i ${vcf} \
 --gff ${gff} \
 -o ${vcf}.vep \
 --force_overwrite \
 --dir ${VEP_CACHEDIR} \
 --fasta ${ref} \
 --everything