Skip to content

Pathogenic Predictor of Deep-Intronic Variants causing Aberrant Splicing

License

Notifications You must be signed in to change notification settings

shiro-kur/PDIVAS

Repository files navigation

PDIVAS : Pathogenicity Predictor for Deep-Intronic Variants causing Aberrant Splicing

License: MIT

PDIVAS image

UPDATE info

to v.1.2.0 (2024/11/13)

  • PDIVAS subcommand vcf2tsv became able to handle & output sample columns in VCF files.
  • SpliceAI annotation file (grch38.txt) was updated to GENCODE V47.
  • Debug PDIVAS exceptional output (about 'wo_annots' and 'out_of_scope').

Sumary

  • PDIVAS is a pathogenicity predictor for deep-intronic variants causing aberrant splicing.
  • The deep-intronic variants can cause pathogenic pseudoexons or extending exons which disturb the normal gene expression and can be the cause of patients with Mendelian diseases.
  • PDIVAS efficiently prioritizes the causal candidates from a vast number of deep-intronic variants detected by whole-genome sequencing.
  • The scope of PDIVAS prediction is variants in protein-coding genes on autosomes and X chromosome.
  • This command-line interface is compatible with variant files in VCF format.

PDIVAS is modeled on random forest algorism to classify pathogenic and benign variants with referring to features from

  1. Splicing predictors of SpliceAI (Jaganathan et al., Cell 2019) and MaxEntScan (Yeo and Berge, j. Comput. Biol. 2004)
    (*)The output module of SpliceAI was customed for PDIVAS features (see the Option2, for the details).

  2. Human splicing constraint score of ConSplice (Cormier et al., BMC Bioinfomatics 2022).

Reference & contact

Kurosawa et al. BMC Genomics 2023
[email protected] (Ryo Kurosawa at Kyoto University)

<Option1>
Prediction with the PDIVAS-precomputed files (SNV+ short indels (1~4nt))

For the quick implementation of PDIVAS, please use the score-precomputed file here. Possible rare SNVs and short indels (1~4nt) in genes (n=4,512) of Mendelian diseases were comprehensively annotated in the file. To annotate your VCF file, please run the command below,for example.

0. Installation

conda install -c bioconda vcfanno
git clone https://github.com/brentp/vcfanno.git

1. Setting score-precomputed files

(Download score-precomputed file above and create a configure file (following https://github.com/brentp/vcfanno))

vi ./conf.toml

Write as below

[[annotation]]
file="./PDIVAS_precomputed/GRCh38/PDIVAS_precomputed_short_GRCh38.vcf.gz"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = ["PDIVAS"]
ops=["self"]
names=["PDIVAS"]

2. Perform PDIVAS annotation

# Move to your working directory. (The case below is the directory in this repository.)
cd examples

# Perform annotation
vcfanno -lua ./vcfanno/example/custom.lua ./conf.toml ./ex.vcf > output_precomp.vcf
#Compare the output_precomp.vcf with output_precomp_expect.vcf.gz to validate the successful annotation.

<Option2>
Perform annotation of individual features and calculation of PDIVAS scores

For more comprehensive annotation than pre-computed files, run PDIVAS by following the description below.

0-1. Installation

#It is better to prepare new conda environments for PDIVAS installation.
#They take a little long time to solve the environment.
conda create -n PDIVAS -c bioconda -c conda-forge spliceai tensorflow==2.6.2 pdivas bcftools vcfanno
conda create -n VEP -c conda-forge -c bioconda perl==5.26.2 ensembl-vep==105

The successful installation was verified on anaconda version 23.3.1

0-2. Setting customed usages

-For output-customized SpliceAI for PDIVAS conda environment
https://github.com/shiro-kur/SpliceAI

git clone https://github.com/shiro-kur/SpliceAI.git
cp -r SpliceAI/spliceai/* ~/miniconda3/envs/PDIVAS/lib/python3.9/site-packages/spliceai/

-For VEP custom usage

# Download VEP cache files
$ mkdir -p ~/Ref/.vep
$ cd ~/Ref/.vep
$ wget https://ftp.ensembl.org/pub/release-113/variation/vep/homo_sapiens_vep_113_GRCh38.tar.gz
$ tar xzf homo_sapiens_vep_113_GRCh38.tar.gz

#Setting MaxEntScan
$ mkdir -p ~/Ref/.vep/Plugin/MaxEntScan
$ cd ~/Ref/.vep/Plugin/MaxEntScan
$ wget http://hollywood.mit.edu/burgelab/maxent/download/fordownload.tar.gz
$ tar xzf fordownload.tar.gz

#Setting ConSplice
$ cd ~/Ref/.vep
$ wget https://storage.cloud.google.com/pdivas/ConSplice_for_PDIVAS/ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz
$ tabix -f ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz

The ConSplice file was edited from the originally scored file by (Cormier et al., BMC Bioinformatics 2022).

1. Preprocessing VCF format (resolve the multi-allelic site to biallelic sites)

conda activate PDIVAS
bcftools norm -m - multi.vcf > bi.vcf

2. Add gene annotations, MaxEntScan scores, and ConSplice scores with VEP.

conda activate VEP
vep \
--cache --offline --cache_version 107 --assembly GRCh38 --hgvs --pick_allele_gene \
--fasta ./references/hg38.fa.gz --vcf --force \
--custom ./references/ConSplice.50bp_region.inverse_proportion_refo_hg38.bed.gz,ConSplice,bed,overlap,0 \
--plugin MaxEntScan,./references/MaxEntScan/fordownload,SWA,NCSS \
--fields "Consequence,SYMBOL,Gene,INTRON,HGVSc,STRAND,ConSplice,MES-SWA_acceptor_diff,MES-SWA_acceptor_alt,MES-SWA_donor_diff,MES-SWA_donor_alt" \
--compress_output bgzip \
-i ./examples/ex.vcf.gz -o ./examples/ex_vep.vcf.gz

3. Add output-customized SpliceAI scores

conda activate PDIVAS
spliceai -I examples/ex_vep.vcf.gz -O examples/ex_vep_AI.vcf -R hg38.fa -A grch38 -D 300 -M 1

4. Perform the detection of deep-intronic variants and PDIVAS prediction

pdivas predict -I examples/ex_vep_AI.vcf -O examples/ex_vep_AI_PD.vcf.gz -F off

5. (Optional) Convert VCF file with PDIVAS annotation to TSV file (1 gene annotation per 1 line)

pdivas vcf2tsv -I examples/ex_vep_AI_PD.vcf.gz -O examples/ex_vep_AI_PD.tsv

Usage of PDIVAS command line

1. $ pdivas predict

Required parameters:

  • -I: Input VCF(.vcf/.vcf.gz) with variants of interest.
  • -O: Output VCF(.vcf/.vcf.gz) with PDIVAS predictions GENE_ID|PDIVAS_score Variants in multiple genes have separate predictions for each gene.

Optional parameters:

  • -F: filtering function (off/on) : Output all variants (-F off; default) or only deep-intronic variants with PDIVAS scores (-F on)")

Details of PDIVAS INFO field:

ID Description
GENE_ID Ensembl gene ID based on GENCODE V41(GRCh38) or V19(GRCh37)
PDIVAS <Predicted result>
Pattern 1 : 0.000-1.000 float value (The higher, the more deleterious)
<Exceptions>
- Output with '-F off'. Filtered with '-F on'.
Pattern 2 : 'wo_annots', variants out of VEP or SpliceAI annotations :
Pattern 3 : 'out_of_scope', variants without PDIVAS annotation scope
(chrY, non-coding gene or non-deep-intronic variants) 
Pattern 4 :'no_gene_match', variants without matched gene annotation between VEP and SpliceAI

2. $ pdivas vcf2tsv

Required parameters:

  • -I: *Input VCF(.vcf/.vcf.gz) with VEP, SpliceAI,and PDIVAS annotations.
  • -O: The path to output tsv file name and pass.
    *Input VCF is valid only when it was generated through this pipeline.

Interpretation of PDIVAS scores

More details in Kurosawa et al. medRxiv 2023 .

Threshold Sensitivity (*1) candidates/individual (*2)
>=0.082 95% 26.8
>=0.151 90% 14.5
>=0.340 85% 6.7
>=0.501 80% 4.1
>=0.575 75% 3.0
>=0.763 70% 1.2

(*1) Sensitivities were calculated on curated pathogenic deep-intronic variants in a test dataset.
(*2) Candidates of pathogenic deep-intronic variants were obtained through the process described below. (WGS: Whole-genome sequencing)

Cand_image

About

Pathogenic Predictor of Deep-Intronic Variants causing Aberrant Splicing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages