Software for predicting LIP tagged aminoacidic residues
Here is the list of libraries used during LIP predictor development, for which we guarantee the software will work.
- python >= 3.6
- pandas >= 0.24.2
- numpy >= 1.16.2
- sklearn >= 0.20.3
- scipy >= 1.2.1
- joblib >= 0.13.2
- requests >= 2.21.0
- biopython >= 1.73
- xssp >= 3.07 (available here)
LIP_predictor is composed of two main scripts:
- train.py
- predict.py
Note: command are given from inside LIP_predictor directory in this guide.
Basic training: takes as input the model saved as default in ./model_files/default.joblib and retrains it with given data.
python3 ./train.py [options] [lip_file]
- lip_file: path to training file, which contains proteins, chains and LIP/non-LIP tag instruction. String;
- -e, --exclude: allows to exclude some proteins from the given lip_file. String;
- -ws, --window_size: define the size of the window used to compute an average of the residues features. Int;
- -rf, --ring_force: forces to download RING data, even if it has already been downloaded. Int in {0,1};
- -rd, --ring_dir: define directory where RING data is stored. String;
- -pd, --pdb_dir: define directory where PDB data is stored. String;
- -cf, --config_file: define a configuration file from which other settings will be loaded. JSON;
Retrain default model (the one specified in default configuration file: RandomForestClassifier with 120 estimators), but excluding the proteins with PDB id equal to 1cee and 1dev. Correct LIP tags are specified in file ./datasets/lips_dataset.txt.
python train.py -e 1cee -e 1dev ./datasets/lips_dataset.txt
Basic prediction takes as input a PDB id and computes LIP score and LIP flag for every aminoacidic residue in the given protein.
python3 ./predict.py [options] [pdb_id]
- pdb_id: PDB id of the protein for which LIP tags will be predicted. String;
- -ws, --window_size: define the size of the window used to compute an average of the residues features. Int;
- -rf, --ring_force: forces to download RING data, even if it has already been downloaded. Int in {0,1};
- -rd, --ring_dir: define directory where RING data is stored. String;
- -pd, --pdb_dir: define directory where PDB data is stored. String;
- -cf, --config_file: define a configuration file from which other settings will be loaded. JSON;
- -of, --out_file: define the file where predictions will be printed out; String;
- -od, --out_dir: define the directory where prediction file will be created and written; String
This example outputs a prediction for protein with PDB id equal to 1cee. The file with the prediction will be find in the current folder with the name 1cee.txt. Note that the latest trained model will be used.
python3 ./predict.py python 1cee -od ./
./config.json
It is possible to provide a custom configuration file for either train.py and predict., JSON formatted. Custom configuration file entries will be overwritten by command line parameters, which are considered to have higher priority.
Inside custom configuration file, it is possible to spcify which kind of model we want to be trained by defining 'model' entry as a dictionary, which has two parameters itself: 'name' and 'args'.
- model.name: Name of the scikit-learn classifier which must be trained. String in ['RandomForestClassifier', 'LogisticRegression', 'MLPCLassifier', 'KNeighborsClassifier', 'SVC', 'QuadraticDiscriminantAnalysis'];
- model.args: Arguments for scikit-learn classifier chosen. Dict;
./modules/feature_extraction.py
LIP_ppredictor adopts a Relational Database kind of approach. First of all, we extract residues we are interested in from PDB database, then we extract features from DSSP application and RING service. We store either PDB information, RING and DSSP features into tables in which the triple (PDB_ID, CHAIN_ID, RES_ID) is the unique identifier of the row. Given this assumption, we build a complete dataset of features by joining every table on the unique identifier trible described above.
Since we work with tables, we strongly leveraged Pandas library provided by Anaconda3 and Python3.
PDB database is downloaded from server using BioPython utils, saved in the appropriate directory (by default ./pdb_files) with extension .ent. Afterwards, it gets read from the stored file and loaded into a Pandas DataFrame.
Table:
- PDB_ID: id of the protein; String;
- MODEL_ID: model contained into the protein; Int;
- CHAIN_ID: chain contained in the model. We kept only the 0-th model; Char;
- CHAIN_LEN: length of the chain. Equal value for every entry given same triple (PDB_ID, MODEL_ID, CHAIN_ID); Int;
- RES_ID: residue id of a specific aminoacid. String casted to Int;
- RES_NAME: name of the aminoacid which composes the residue; String;
- LIP_SCORE: probability of being a Linear Interacting Peptide, assigned by a trained ML model (in case of prediction) or manually flagged (in case of model training). Takes values in [0,1];
- LIP: defines if a residue is a LIP. Takes values in {0,1};
DSSP features extracted using DSSP program through BioPython DSSP interface. Retrieved data are immediately inserted into a Pandas DataFrame.
Table:
- PDB_ID: see PDB;
- CHAIN_ID: see PDB;
- RES_ID: see PDB;
- SEC_STRUCT: defines to which secondary structure the residue belongs. String;
- REL_ASA: RELative Absolute Solvent Accessibility measusres the surface of the residue exposed to the solvent. Takes values in [0,1];
- PHI, PSI: dihedral angles of the aminoacid, expressed in degrees. If the angle does not exist, its value will be 360. Takes values in [0, 360]
- NH_O_1_relidx, NH_O_1_energy, O_NH_1_relidx, O_NH_1_energy, NH_O_2_relidx, H_O_2_energy, O_NH_2_relidx, O_NH_2_energy: energy values. Double;
RING features are extracted by a request to RING APIs. First, the "elaborate" POST request is sent by our application acting as a client. If request responds with a 200 OK status, then client starts polling in order to intercept termination of server's computational phase. Afterwards, computed information is downloaded as a file with .zip extension. Downloaded file is then unzipped and read, and RING table is defined by means of Pandas DataFrame.
Table:
- PDB_ID: see PDB;
- CHAIN_ID: see PDB;
- RES_ID: see PDB;
- EDGE_LOC: locations where contacts are formed; Strings separated by space charachter ('');
- EDGE_TYPE: type of contacts; Strings separated by space charachter (' ');
- INTRA_CONTACTS: number of intra-chain contacts formed by the aminoacid. Int;
- INTER_CONTACTS: number of inter-chain contacts formed by the aminoacid. Int;
- INTRA_INTER_CONTACTS: intra-chain contacts / inter-chain contacts ratio of the aminoacid. Double;
./modules/feature_preprocessing.py
Categorical Features such the name of the residues or the type of secondary structure, have been handled using One-Hot-Encoding procedure (https://en.wikipedia.org/wiki/One-hot). This procedure create one feature for each level of the categorical feature, assigning 1 to the features that represent the categorical values of our instances and 0 to all the others features.
This procedure has been done for the following features:
- Type of residues
- Type of secondary structure
- Type of contacts
In order to get information of the context, a sliding windows by a rolling procedure (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) with gaussian filtering has been applied to our instances (residues).
Every value has been substituted by a gaussian mean of windows k centered in that value. In this way we take in consideration also the values of the k closest residues, assigning to them a multiplicative factor which goes to 0 as the distance increases and is 1 for the centered value.
Parameters of sliding windows:
- windows size (integer, default = 5): size of the windows (usually an odd number for symmetric purpose).
- std (float, default = 1): standard deviation of the gaussian filter. Higher values mean a minor decrease of the multiplicative factor as the distance increase.
N.B. For the first values and the last values of every chain, it has been create a mirroring of their next/previous residues. E.G:
res1, res2, res3 --> res3, res2, res1, res2, res3
./modules/models
In order to validated our model a LOO-CV has been applied. In our case the "one left out" wasn't a single instances but a whole protein.
For protein p we trained the model with all the proteins available except the protein p and then we used as test that protein p.
A Random Forest Classifier has been used to extract the best features. Since the type of contacts extracted from the RING server didn't result to have meaningful impact (less than 0.001%) we excluded them from the input matrix.
Various algorithms has been used with various parameters (k-Nearest Neighbours, Support Vector Classifier, Linear Discriminant Analysis, Quadratic Discriminant Analysis, MultiLayerPerceptron, Random Forest, Decision Tree, AdaBoost, Logistic Regression).
The best performing algorithm have been Random Forest Classifier and the Multi-Layer Perceptron (MLP) with a balanced accuray of over 0.90 and a f1 score greater than 0.85.
In the end it has been decided to use a Random Forest Classifier because of better performance and less training time required.
After a grid search the best parameters apperead to be a sliding windows between 3 and 7, a standard deviation of the gaussian filtering around 1, a number of estimator around 100 (80-120) and no limits to trees depth and number of leaves.
FINAL MODEL
- Sliding Windows: a) Windows Size = 5 b) Standard Deviation = 1
- Classifier: a) Type = Random Forest Classifier: b) Parameters = {n_estimator: 100}
CURRENT VALIDATION RESULT
For every protein has been computed balanced accuracy and f1 score and the final results are the avarage of this two scores of every protein. Random seed not set
- Balanced Accuracy: 0.930
- F1-Score: 0.894