to v.1.2.0 (2024/11/13)
- PDIVAS subcommand vcf2tsv became able to handle & output sample columns in VCF files.
- SpliceAI annotation file (grch38.txt) was updated to GENCODE V47.
- Debug PDIVAS exceptional output (about 'wo_annots' and 'out_of_scope').
- PDIVAS is a pathogenicity predictor for deep-intronic variants causing aberrant splicing.
- The deep-intronic variants can cause pathogenic pseudoexons or extending exons which disturb the normal gene expression and can be the cause of patients with Mendelian diseases.
- PDIVAS efficiently prioritizes the causal candidates from a vast number of deep-intronic variants detected by whole-genome sequencing.
- The scope of PDIVAS prediction is variants in protein-coding genes on autosomes and X chromosome.
- This command-line interface is compatible with variant files in VCF format.
PDIVAS is modeled on random forest algorism to classify pathogenic and benign variants with referring to features from
-
Splicing predictors of SpliceAI (Jaganathan et al., Cell 2019) and MaxEntScan (Yeo and Berge, j. Comput. Biol. 2004)
(*)The output module of SpliceAI was customed for PDIVAS features (see the Option2, for the details). -
Human splicing constraint score of ConSplice (Cormier et al., BMC Bioinfomatics 2022).
Kurosawa et al. BMC Genomics 2023
[email protected] (Ryo Kurosawa at Kyoto University)
For the quick implementation of PDIVAS, please use the score-precomputed file here. Possible rare SNVs and short indels (1~4nt) in genes (n=4,512) of Mendelian diseases were comprehensively annotated in the file. To annotate your VCF file, please run the command below,for example.
conda install -c bioconda vcfanno
git clone https://github.com/brentp/vcfanno.git
(Download score-precomputed file above and create a configure file (following https://github.com/brentp/vcfanno))
vi ./conf.toml
Write as below
[[annotation]]
file="./PDIVAS_precomputed/GRCh38/PDIVAS_precomputed_short_GRCh38.vcf.gz"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = ["PDIVAS"]
ops=["self"]
names=["PDIVAS"]
# Move to your working directory. (The case below is the directory in this repository.)
cd examples
# Perform annotation
vcfanno -lua ./vcfanno/example/custom.lua ./conf.toml ./ex.vcf > output_precomp.vcf
#Compare the output_precomp.vcf with output_precomp_expect.vcf.gz to validate the successful annotation.
For more comprehensive annotation than pre-computed files, run PDIVAS by following the description below.
#It is better to prepare new conda environments for PDIVAS installation.
#They take a little long time to solve the environment.
conda create -n PDIVAS -c bioconda -c conda-forge spliceai tensorflow==2.6.2 pdivas bcftools vcfanno
conda create -n VEP -c conda-forge -c bioconda perl==5.26.2 ensembl-vep==105
The successful installation was verified on anaconda version 23.3.1
-For output-customized SpliceAI for PDIVAS conda environment
https://github.com/shiro-kur/SpliceAI
git clone https://github.com/shiro-kur/SpliceAI.git
cp -r SpliceAI/spliceai/* ~/miniconda3/envs/PDIVAS/lib/python3.9/site-packages/spliceai/
-For VEP custom usage
# Download VEP cache files
$ mkdir -p ~/Ref/.vep
$ cd ~/Ref/.vep
$ wget https://ftp.ensembl.org/pub/release-113/variation/vep/homo_sapiens_vep_113_GRCh38.tar.gz
$ tar xzf homo_sapiens_vep_113_GRCh38.tar.gz
#Setting MaxEntScan
$ mkdir -p ~/Ref/.vep/Plugin/MaxEntScan
$ cd ~/Ref/.vep/Plugin/MaxEntScan
$ wget http://hollywood.mit.edu/burgelab/maxent/download/fordownload.tar.gz
$ tar xzf fordownload.tar.gz
#Setting ConSplice
$ cd ~/Ref/.vep
$ wget https://storage.cloud.google.com/pdivas/ConSplice_for_PDIVAS/ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz
$ tabix -f ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz
The ConSplice file was edited from the originally scored file by (Cormier et al., BMC Bioinformatics 2022).
conda activate PDIVAS
bcftools norm -m - multi.vcf > bi.vcf
conda activate VEP
vep \
--cache --offline --cache_version 107 --assembly GRCh38 --hgvs --pick_allele_gene \
--fasta ./references/hg38.fa.gz --vcf --force \
--custom ./references/ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz,ConSplice,bed,overlap,0 \
--plugin MaxEntScan,./references/MaxEntScan/fordownload,SWA,NCSS \
--fields "Consequence,SYMBOL,Gene,INTRON,HGVSc,STRAND,ConSplice,MES-SWA_acceptor_diff,MES-SWA_acceptor_alt,MES-SWA_donor_diff,MES-SWA_donor_alt" \
--compress_output bgzip \
-i ./examples/ex.vcf.gz -o ./examples/ex_vep.vcf.gz
conda activate PDIVAS
spliceai -I examples/ex_vep.vcf.gz -O examples/ex_vep_AI.vcf -R hg38.fa -A grch38 -D 300 -M 1
pdivas predict -I examples/ex_vep_AI.vcf -O examples/ex_vep_AI_PD.vcf.gz -F off
pdivas vcf2tsv -I examples/ex_vep_AI_PD.vcf.gz -O examples/ex_vep_AI_PD.tsv
Required parameters:
-I
: Input VCF(.vcf/.vcf.gz) with variants of interest.-O
: Output VCF(.vcf/.vcf.gz) with PDIVAS predictionsGENE_ID|PDIVAS_score
Variants in multiple genes have separate predictions for each gene.
Optional parameters:
-F
: filtering function (off/on) : Output all variants (-F off; default) or only deep-intronic variants with PDIVAS scores (-F on)")
Details of PDIVAS INFO field:
ID | Description |
---|---|
GENE_ID | Ensembl gene ID based on GENCODE V41(GRCh38) or V19(GRCh37) |
PDIVAS | <Predicted result> Pattern 1 : 0.000-1.000 float value (The higher, the more deleterious) <Exceptions> - Output with '-F off'. Filtered with '-F on'. Pattern 2 : 'wo_annots', variants out of VEP or SpliceAI annotations : Pattern 3 : 'out_of_scope', variants without PDIVAS annotation scope (chrY, non-coding gene or non-deep-intronic variants) Pattern 4 :'no_gene_match', variants without matched gene annotation between VEP and SpliceAI |
Required parameters:
-I
: *Input VCF(.vcf/.vcf.gz) with VEP, SpliceAI,and PDIVAS annotations.-O
: The path to output tsv file name and pass.
*Input VCF is valid only when it was generated through this pipeline.
More details in Kurosawa et al. medRxiv 2023 .
Threshold | Sensitivity (*1) | candidates/individual (*2) |
---|---|---|
>=0.082 | 95% | 26.8 |
>=0.151 | 90% | 14.5 |
>=0.340 | 85% | 6.7 |
>=0.501 | 80% | 4.1 |
>=0.575 | 75% | 3.0 |
>=0.763 | 70% | 1.2 |
(*1) Sensitivities were calculated on curated pathogenic deep-intronic variants in a test dataset.
(*2) Candidates of pathogenic deep-intronic variants were obtained through the process described below. (WGS: Whole-genome sequencing)