Skip to content

Prioritizing Copy Number Variants (CNV) using Phenotype and Gene Functional Similarity

License

Notifications You must be signed in to change notification settings

lgmgeo/DeepSVP

 
 

Repository files navigation

DeepSVP

DeepSVP is a computational method to prioritize structural variants (SV) involved in genetic diseases by combining genomic information with information about gene functions. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual celltypes, and anatomical sites of expression. DeepSVP systematically relates them to their phenotypic consequences through ontologies and machine learning.

Training dataset

We train and evaluate our method using human SV collected from dbvar dataset.

Annotation data sources (integrated in the candidate SV prediction workflow)

We integrated the annotations from different sources:

  • Gene ontology (GO)
  • Uber-anatomy ontology (UBERON)
  • Mammalian Phenotype ontology (MP)
  • Human Phenotype Ontology (HPO)

This work is done using DL2vec. We convert different types of Description Logic axioms into graph representation, and then generate an embedding for each node and edge type.

We collected genomics features using the AnnotSV (v2.2) public tool.

Installation

Using pip version 20.3.1:

pip install deepsvp

Or you can create a specific Conda Environments (e.g. named "deepsvp-py38-pip2031"):

conda create -n deepsvp-py38-pip2031 python=3.8 pip=20.3.1
conda activate deepsvp-py38-pip2031
pip3 install deepsvp
pip3 install networkx
pip3 install torch
pip3 list
conda deactivate

Running the DeepSVP prediction model

  • Download all the files from data and place the uncompressed files/repository in the folder named "data":
mkdir DeepSVP/          ;# /path_of_your_DeepSVP_repository/
cd DeepSVP
wget "https://bio2vec.cbrc.kaust.edu.sa/data/DeepSVP/data.zip"
unzip data.zip
cd data                 ;# /path_of_your_DeepSVP_data_repository/
wget "https://bio2vec.cbrc.kaust.edu.sa/data/DeepSVP/experiments.zip"   # can be very long
unzip experiments.zip
  • Download and install the required AnnoSV (2.3) tool in the "data" folder:
cd /path_of_your_DeepSVP_data_repository/
git clone  [email protected]:lgmgeo/AnnotSV.git --branch v2.3
cd AnnotSV/
make PREFIX=. install
make DESTDIR= PREFIX=. install-human-annotation
cd ..
  • Add genomic features to your VCF input file (/path_and_name_of_your_vcf_input_file/) thanks to AnnotSV (v2.3):

e.g. /path_and_name_of_your_vcf_input_file/ = ./input.vcf

e.g. /path_and_name_of_your_annotsv_output_file/ = ./data/output.annotsv.annotated.tsv

bash 
export ANNOTSV=/path_of_your_DeepSVP_data_repository/AnnotSV
$ANNOTSV/bin/AnnotSV -SVinputFile ./input.vcf -genomeBuild GRCh38 -outputFile ./data/output.annotsv.annotated.tsv

Your annotated VCF file (./data/output.annotsv.annotated.tsv) should be placed in the data folder (/path_of_your_DeepSVP_data_repository/).

  • Run the command deepsvp --help to display help and parameters:
Usage: deepsvp [OPTIONS]
      
     DeepSVP: A phenotype-based tool to prioritize caustive CNV using WGS data
     and Phenotype/Gene Functional Similarity
  
Options:
    -d, --data-root TEXT      Data root folder  [required]
    -i, --in-file TEXT        Annotated Input file  [required]
    -p, --hpo TEXT            List of phenotype ids separated by commas
                              [required]
    -maf, --maf_filter FLOAT  Allele frequency filter using gnomAD and 1000G
                              default<=0.01
    -m, --model_type TEXT     Ontology model, one of the following (go , mp ,
                              hp, cl, uberon, union), default=mp
    -ag, --aggregation TEXT   Aggregation method for the genes within CNV (max
                              or mean) default=max
    -o, --outfile TEXT        Output result file
    --help                    Show this message and exit.        
  • Run the example (with you own HPO terms):
    deepsvp -d data/ -i output.annotsv.annotated.tsv -p HP:0003701,HP:0001324,HP:0010628,HP:0003388,HP:0000774,HP:0002093,HP:0000508,HP:0000218 -m cl -maf 0.01 -ag max -o example_output.txt

Or run the example with the deepsvp-py38-pip2031 Conda Environment:

conda activate deepsvp-py38-pip2031
deepsvp -d data/ -i $your_annotsv_output.annotated.tsv -p HP:0003701,HP:0001324,HP:0010628,HP:0003388,HP:0000774,HP:0002093,HP:0000508,HP:0000218 -m cl -maf 0.01 -ag max -o example_output.txt
conda deactivate

Or by using cwl-runner, modify the input file in the input example yaml deepsvp.yaml file and then run:

cwl-runner deepsvp.cwl deepsvp.yaml 
|========                        | 25% Reading the input phenotypes...
|================                | 50% Phenotype prediction... 
|========================        | 75% CNV Prediction... 
|================================| 100% DONE! You can find the prediction results in the output file: example_output.txt

Output:

The script will output a ranking a score for the candidate caustive CNV.

Scripts

  • Details for predicting pathogenic variants and comparison with other methods can be found in the experiment folder.
  • annotations.sh: This script is used to annotate the varaints.
  • data_preprocessing.py: preprocessing the annotations and features.
  • pheno_model.py: script to get the DL2vec score using the trained model.
  • deepsvp_training.py: script to train and testing the model, with Hyperparameter optimization
  • BWA_GATK.sh : script to run GATK workflow for the input fastq files for the real samples, run using KAUST Supercomputing IBEX.
  • run_Manta.sh : script to generate VCF with the structural variants (SVs), we used Manta to identify the candidate SVs. run using KAUST Supercomputing IBEX.

Final notes

For any questions or comments please contact: [email protected]

About

Prioritizing Copy Number Variants (CNV) using Phenotype and Gene Functional Similarity

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 84.1%
  • Shell 14.4%
  • Common Workflow Language 1.5%