A Deep Learning Framework for sequence-based Protein Crystallization Prediction
Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, majority of these methods build their predictors by extracting features from protein sequences which is computationally expensive and can explode the feature space.
We propose, DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction quality crystals without the need to manually engineer additional biochemical and structural features from sequences. Our model is based on Convolutional Neural Networks (CNNs) which can exploit k-mer structure and interaction among sets of k-mers from the raw protein sequences.
Our model surpasses previous sequence-based protein crystallization predictors in terms of
recall, F-score, accuracy and MCC on three independent test sets. DeepCrystal achieves an average
improvement of 1.4%, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf
respectively. In addition, DeepCrystal attains an average improvement of 2.1%, 6.0% for F-score, 1.9%,
3.9% for accuracy and 3.8%, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets
web-server is also available at https://deeplearning-protein.qcri.org
Be sure the following tools are installed on your machine:
- wget, git, unzip
1- Get anaconda (64 bit)installer python3.x for linux : https://www.anaconda.com/download/#linux
2- Run the installer : bash Anaconda3-5.2.0-Linux-x86_64.sh, and follow the instructions to install anaconda at your preferred directory.
- git clone https://github.com/elbasir/DeepCrystal.git
- cd DeepCrystal
- export PATH=<your_anaconda_folder>/bin:$PATH
- conda env create -f environment.yml
- source activate deepCrystal
In order to test DeepCrystal on a fasta file, you need to run it while you are inside deepCrystal environment.
- source deactivate deepCrystal
1- Protein sequences have to be saved in a fasta format similar to following format:
.>Seq1
MPKFYCDYCDTYLTHDSPSVRKTHCSGRKHKENVKDYYQKWMEEQAQSLIDKTTAAFQQG
where '>Seq1' represents the fasta id and the second line is the protein sequence.
2- Download the model files ( all files *.hdf5 and files *.json) by running the following command:
wget https://storage.entrydns.org/nextcloud/index.php/s/3ErNEaZiKp39x4N/download
3- Run the following two commands after downloading the model files:
- unzip download
- rm download
4- To test your protein sequences using Test.py run the following command:
$ python Test.py <file.fasta>
5- The output will be generated in the current working directory. The name of the output file is prediction_results.csv.
Sequence ID | Prediction |
---|---|
Seq1 | 0.7230646491 |
6- If you run on test.fasta that's uploaded on this github, you can compare the results with the Expected_Prediction_Result.csv that's also uploaded on this github.
7- When you run Test.py, you will see some warnings which will not affect your results. Examples of these warnings are in expected_warnings.txt
-
following the same steps as in the section "Creating deepCystal environment" , you can train your own data using Train.py
-
Train.py and the fasta file have to be in the same directory .
-
Example of how to train the model on your own data, run the following command:
$ python Train.py <file.fasta>
A simple example on how the fasta file should look like:
.>Seq1 Crystallizable
MERVAVVGVPMDLGANRRGVDMGPSALRYARLLEQLEDLGYTVEDLGDVPVSLARASRRRGRGLAYLEEIRAAALVLKERLAALPEGVFPIVLGGDHSLSMGSVAGAARGRRVGVVWVDAHADFNTPETSPSGNVHGMPLAVLSGLGHPRLTEVFRAVDPKDVVLVGVRSLDPGEKRLLKEAGVRVY
.>Seq2 Non Crystallizable
MPRSLKKGVFVDDHLLEKVLELNAKGEKRLIKTWSRRSTIVPEMVGHTIAVYNGKQHVPVYITENMVGHKLGEFAPTRTYRGHGKEAKATKKK
This file contains the architecture of DeepCrystal model.