😊 Welcome! 😊
This repository describes the ANNOVAR databases💻 for hs1 (CHM13v2.0)🧬 and how to use it. Feel free to leave questions or problems while using this database in the Issues menu.
How to generate ANNOVAR databases
The original files were listed below to match the names of ANNOVAR databases, and the formatted accordingly.
If the original files are based on GRCh38 or references other than CHM13v2.0, the files were lifted over to CHM13 using crossmap. The chain file can be downloaded from the [CHM13 GitHub page](https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/grch38-chm13v2.chain)
hs1_refGene.txt
: ANNOVAR homepage
hs1_curGene5.2.txt
: CHM13 github - This contains curated annotations of the ampliconic genes on the Y chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 annotation.
-
If the original file was formatted in GFF, I transformed it to GTF and then used gtfToGenePred to convert it into GenePred format.
-
The gene annotation databases in ANNOVAR website used to come with ${prefix}Mrna.fa. This file was generated using
retrieve_seq_from_fasta.pl
script.retrieve_seq_from_fasta.pl --format refGene --seqfile chm13v2.0.fa hs1_refGene.txt --out hs1_refGeneMrna.fa
-
format(hs1_curGene.txt):
1 XR_002958507.2 chr1 - 6046 13941 13941 13941 4 6046,12077,13444,13679, 6420,12982,13579,13941, 0 LOC124900618 none none -1,-1,-1,> 2 XR_007068557.1_1 chr1 + 15079 21429 21429 21429 2 15079,20565, 15564,21429, 0 LOC124905335_1 none none -1,-1, 3 XM_047436352.1 chr1 - 20528 37628 20949 37628 5 20528,28446,34957,36085,37442, 21087,28626,35059,37081,37628, 0 LOC112268260 incmpl i> 4 NR_125957.1 chr1 - 52978 54612 54612 54612 3 52978,53559,54521, 53422,53826,54612, 0 LOC101928626 none none -1,-1,-1, 5 NM_001005221.2 chr1 - 111939 112877 111939 112877 1 111939, 112877, 0 OR4F29 incmpl incmpl 0, 6 XR_001743907.1_1 chr1 - 115045 117120 117120 117120 2 115045,116798, 115300,117120, 0 LOC107986552_1 none none -1,-1,
hs1_dbsnp156.txt
: NCBI DBsnp ftp server
hs1_clinvar_20231217.txt
: CHM13 github
hs1_${population}.sites.2023_12.txt
: CHM13 github
- 1KGP allele frequency recalled on T2T-CHM13v2.0. Now available for all chromosomes, for the entire 3,202 samples or the unrelated 2504 samples. (popultation : ALL, AFR, AMR, EAS, EUR, and SAS)
- format(hs1_ALL.sites.2023_12.txt) :
chr1 131 A C 0.00278552 chr1 131 A T 0.00278552 chr1 878 A AACCCTAACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTC 0.00019976 chr1 884 AACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTCACCCTC A 0.00020136
hs1_gwas_20231207.txt
: GWAS Catalog v1.0-associations_e110_r2023-12-07
hs1_cenSat.txt
: CHM13 github - A more comprehensive centromere/satellite repeat annotation.
hs1_nonSyntenic.txt
: CHM13 github - Regions non-syntenic (unique) compared to GRCh38.
hs1_hg38_issues.txt
: (CHM13v1 publication)[https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/publications/Nurk_2021/fig1/]
hs1_sraccess.txt
: CHM13 github - short read accessible regions on CHM13
hs1_sraccess_hg38.txt
: CHM13 amazon download server short read accessible regions on GRCh38 reference then, liftovered to CHM13
hs1_sraccess_hs1Only.txt
: short read accessible regions only in CHM13, not in GRCh38
- format(hs1_cenSat.txt) :
1 chr1 116796047 121405145 ct_1_1(p_arm) 100 . 116796047 121405145 224,224,224 1 chr1 121405145 121406286 censat_1_1(rnd-6_family-4384) 100 . 121405145 121406286 0,204,204 1 chr1 121406286 121619169 ct_1_2 100 . 121406286 121619169 224,224,224 1 chr1 121619169 121625213 hor_1_1(S3C1H2-A,B,C) 100 . 121619169 121625213 255,146,0 1 chr1 121625213 121667941 hor_1_2(S3C1H2-A,B) 100 . 121625213 121667941 255,146,0 1 chr1 121667941 121788213 hor_1_3(S3C1H2-B) 100 . 121667941 121788213 255,146,0 1 chr1 121788213 121790362 ct_1_3 100 . 121788213 121790362 224,224,224
Pre-built databases and an example file are available to download from here, along with original resources we used for generating the databases and operation details for each database on T2T-CHM13 Databases and operation.
The directory structure should be like this:
- hs1
- hs1_${Database_name_1}.txt
- hs1_${Database_name_2}.txt
- hs1_${Database_name_3}.txt
...
After downloading the databases, unzip each database into a .txt file:
bgzip -d filename.txt.gz
If you intend to annotate variants using the gene annotation databases such as hs1_curGene.txt.gz
and hs1_refGene.txt.gz
, please download ${Database_name}Mrna.fa.gz
file as well.
There are index file(.txt.idx
) for filter-based databases, such as allele frequency(hs1_ALL.sites.2023_12.txt
) and dbSNP(hs1_snp156.txt
). It is not a mandatory file, but it would be helpful to reduce computing time during variant annotation.
Make your target VCF file compatible with ANNOVAR before annotating with this database. Use the convert2annovar.pl
script to convert your VCF file to ANNOVAR input format:
perl convert2annovar.pl -format vcf4 ${prefix}.vcf > ${prefix}.avinput
Adjust the parameters based on your specific requirements. If you would like to learn more of the detailed command line usage, please visit the ANNOVAR website.
An example file(example.chr22.inp
) used in this description is available here. This file has been converted from VCF to Annovar-compatible format and contains only the data from chromosome 22.
Annotate your variants using ANNOVAR with the following command:
annovarDB="/path/to/annovar/hs1" # The folder where include hs1/ folder
build="hs1"
outPrefix="OUT" # the prefix you want. The ouput name would be ${OUT}.hs1_multianno.txt
table_annovar.pl \
--thread 10 \
example.chr22.inp \ # Input file from step 2
$annovarDB/hs1 \ # annovar database direcoty
-buildver $build \ # hs1 (database prefix)
-out ${outPrefix} \ # output prefix (example.chr22.out for the example output)
-remove \
-protocol refGene,curGene5.2,dbsnp156,1000g2023dec_all,1000g2023dec_afr,1000g2023dec_amr,1000g2023dec_eas,1000g2023dec_eur,1000g2023dec_sas,clinvar_20231217,gwas_20231207,nonSyntenic,hg38_issues,feat,cenSat,sraccess,sraccess_hg38,sraccess_hs1Only \
-operation g,g,f,f,f,f,f,f,f,f,r,r,r,r,r,r,r,r \
-arg ',,,,,,,,,,,,,,,,,'
Keep in mind that operations such as g
, f
, or r
needs to match each database. You can find the correct operation for each annotation source on T2T-CHM13 Databases and operation.
There are numerous plugins available for VEP, such as AlphaMissense, CADD, SpliceVault, and others. You can find them on the VEP plugin homepage. Here, we introduce how to run VEP using a VCF and GFF file based on the CHM13 reference. You can download the reference genome and gene annotation in GFF format from the CHM13 GitHub repository. The plugins are optional, so you can use others as needed; this is just an example.
ori_gff=chm13v2.0_RefSeq_Liftoff_v5.1.gff3
grep -v "#" $ori_gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > chm13v2.0_RefSeq_Liftoff_v5.1.gff_forVEP.gz
tabix -p gff chm13v2.0_RefSeq_Liftoff_v5.1.gff_forVEP.gz
vcf=test.vcf.gz
ref=chm13v2.0.fa # whatever reference you've used for aligning.
gff=chm13v2.0_RefSeq_Liftoff_v5.2.gff_forVEP.gz
VEP_CACHEDIR=./vepCash
mkdir -p $VEP_CACHEDIR
vep --fork 50 \
-i ${vcf} \
--gff ${gff} \
-o ${vcf}.vep \
--force_overwrite \
--dir ${VEP_CACHEDIR} \
--fasta ${ref} \
--everything