Struct2GO:protein function prediction based on Graph pooling algorithm and AlphaFold2 structure information
Struct2GO is a protein function prediction model based on self-attention graph pooling, which utilizes structural information from AlphaFold2 to augment the accuracy and generality of the model's predictions.
- Protein structure: download from the AlphaFold Protein Struct Databasee
- Protein sequence: download from the UniProt website
- Protein annotion: down from the GOA website
- Gene Ontology: download from the GO website
We put the processed data for train and test on there
We put the Source Data there
predicted_struct_protein_data.tar.gz、protein_contact_map.tar.gz、struct_feature.tar.gz supplement there
include:
File/Folder name | Description |
---|---|
predicted_struct_protein_data | Alphafold2 predicted human protein 3D structure datasets. |
protein_contact_map | Computed CA-CA protein contact map. |
struct_feature | Protein structural features. |
dict_sequence_feature | Protein sequence features. |
gos_bp.csv | GO terms corresponding to all human proteins in the BP branch. |
gos_mf.csv | GO terms corresponding to all human proteins in the MF branch. |
gos_cc.csv | GO terms corresponding to all human proteins in the CC branch. |
Run the run_train.sh
script directly to train the model(e.g. for MFO)
python run_train.sh
Note: Remember to update the file directory in the script to your local directory if you wish to run the MFO model or the other two models.
Run the run_test.sh
scirpy directly to evaluation the model(e.g. for MFO)
python run_test.sh
Note: Remember to update the file directory in the script to your local directory if you wish to evaluation the MFO model or the other two models.
we provide the proccesed data for training and evaluating directly there, and then we will explain how to process the raw data.
- Download protein structure data and convert the three-dimensional atomic structure of proteins into protein contact maps.
cd ./data_processing
python predicted_protein_struct2map.py
- Obtain amino acid residue-level features through the Node2vec algorithm.
cd ./angel-master/spark-on-angel/example/local/Node2VecExample.scala
(ps:run it by the IntelLLiJ IDEA )
cd .data_processing
python sort.py
- Download protein sequence data obtain protein sequence features through the Seqvec model.
cd ./data_processing
python seq2vec.py
cd ./model
python labels_load.p
cd ./data_processing
python divide_data.py