Skip to content

DeepMito - Prediction of protein sub-mitochondrial localization using deep learning

License

Notifications You must be signed in to change notification settings

BolognaBiocomp/deepmito

Repository files navigation

DeepMito - Prediction of protein sub-mitochondrial localization using deep learning

The DeepMito Docker Image

Image availbale on DockerHub https://hub.docker.com/r/bolognabiocomp/deepmito

Usage of the image

The first step to run DeepMito Docker container is the pull the container image. To do so, run:

$ docker pull bolognabiocomp/deepmito

Now the DeepMito Docker image is installed in your local Docker environment and ready to be used.

To show DeepMito help page run:

$ docker run bolognabiocomp/deepmito -h

usage: deepmito.py [-h] {multi-fasta,pssm} ...

DeepMito: Predictor of protein submitochondrial localization

optional arguments:
  -h, --help          show this help message and exit

subcommands:
  valid subcommands

  {multi-fasta,pssm}  additional help
    multi-fasta       Multi-FASTA input module
    pssm              PSSM input module (one sequence at a time)

The program can be run in two different modes:

  • multi-fasta mode, accepting a FASTA file in input containing one or more sequences. In this mode, DeepMito internally computes a sequence profile using PSIBLAST for each sequence in the input file and then predicts sub-mitochondrial localization.
  • pssm mode, accepting a FASTA file containing a single protein sequence and a pre-computed PSSM file obtained by PSI-BLAST (using -out_ascii_pssm option). In this case, the computation of the sequence profile is skipped. The provided PSSM must be generated from the input sequence (an exception is raised otherwise). Only a single protein sequence can be processed in this mode.

Multi-fasta mode

The show the DeepMito help in multi-fasta mode run:

$ docker run bolognabiocomp/deepmito multi-fasta -h

usage: deepmito.py multi-fasta [-h] -f FASTA -d DBFILE -o OUTF

DeepMito: Multi-FASTA input module.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        The input multi-FASTA file name
  -d DBFILE, --dbfile DBFILE
                        The PSIBLAST DB file
  -o OUTF, --outf OUTF  The output GFF3 file

As can be seen, the program takes three mandatory arguments:

  • a valid FASTA file containing containing the protein sequences to analyze;
  • the FASTA file of the sequence database for internal aligment generation (using the PSI-BLAST program);
  • the output file name where predictions will be stored.

Let's now try a concrete example. First of all, let's downlaod an example sequence from UniProtKB, e.g. Q9NX14:

$ wget https://www.uniprot.org/uniprot/Q9NX14.fasta

Then, we need a large sequence database for building the Multiple Sequence Alignment (MSA) internally used by DeepMito. The MSA will be generated by searching the sequence database for proteins similar to our input protein. The search is internally performed using PSI-BLAST. Since we are using Docker, you don't need to have PSI-BLAST installed in your machine: all requirements are encapsulated into the DeepMito container image!

In our servers, we use the Uniref90 database (release March, 2018). Hence, to reproduce web server output you need to grab this release from the Uniprot website. In this tutorial, for simplicity, we will adopt a different (smaller) database, namely the latest release of UniprotKB/SwissProt. To get it run:

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
$ gunzip uniprot_sprot.fasta.gz

Now, we are ready to predict the sub-mitochondrial localization of our input protein. Run:

$ docker run -v $(pwd):/data/ bolognabiocomp/deepmito -f Q9NX14.fasta -d uniprot_sprot.fasta -o Q9NX14.out

In the example above, we are mapping the current program working directory ($(pwd)) to the /data/ folder inside the container. This will allow the container to see the external FASTA file Q9NX14.fasta and the database file uniprot_sprot.fasta.

After running DeepMito, a database index is generated (using makeblastdb) for the input database, if not present. The file Q9NX14.out now contains the DeepMito prediction in GFF3 format:

$ cat Q9NX14.out
##gff-version 3
Q9NX14	DeepMito	Mitochondrion inner membrane	1	153	0.69	.	.	Ontology_term:GO:0005743;evidence=ECO:0000256

Columns are as follows:

  • Column 1: the protein ID/accession as reported in the FASTA input file;
  • Column 2: the name of tool performing the annotation (i.e. DeepMito)
  • Column 3: the annotated feature along the sequence. Here, the complete input sequence is annotated with the corresponding subcellular localization.
  • Column 4: start position of the feature (always 1);
  • Column 5: end position of the feature (always the sequence length);
  • Column 6: feature annotation score as assigned by DeepMito;
  • Columns 7,8: always empty, reported for compliance with GFF3 format
  • Column 9: Description field. Gene Ontology Cellular Component terms and evidence codes are reported.

PSSM mode

The show the DeepMito help in pssm mode run:

$ docker run bolognabiocomp/deepmito pssm -h

usage: deepmito.py pssm [-h] -f FASTA -p PSIBLAST_PSSM -o OUTF

DeepMito: PSSM input module.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        The input FASTA file name (one sequence)
  -p PSIBLAST_PSSM, --pssm PSIBLAST_PSSM
                        The PSIBLAST PSSM file
  -o OUTF, --outf OUTF  The output GFF3 file

Three arguments are accepted:

  • The full path of the input FASTA file containing protein sequences to be predicted;
  • The output GFF3 file where predictions will be stored;
  • A PSSM file previously generated with PSI-BLAST.

With the protein in the example above (Q9NX14) and the sequence database (uniprot_sprot.fasta), we can create a PSSM file using PSI-BLAST:

$ psiblast -query Q9NX14.fasta -db uniprot_sprot.fasta -out_ascii_pssm Q9NX14.pssm -evalue 0.001 -num_iterations 3

The generated PSSM can be now used as input to DeepMito in pssm mode:

$ docker run bolognabiocomp/deepmito pssm -v $(pwd):/data/ -f Q9NX14.fasta -p Q9NX14.pssm -o Q9NX14.out

In pssm mode, since no sequence database is used to generate the profile, we can skip the mounting of the /seqdb/ folder in the container.

The file Q9NX14.out now contains the DeepMito prediction in GFF3 format as detailed above.

Install and use DeepMito from source

Source code available on GitHub at https://github.com/BolognaBiocomp/deepmito.

Installation and configuration

DeepMito is designed to run on Unix/Linux platforms. The software was written using the Python programming language and it was tested under the Python version 3.

To obtain DeepMito, clone the repository from GitHub:

$ git clone https://github.com/BolognaBiocomp/deepmito

This will produce a directory deepmito. Before running deepmito you need to set and export a variable named DEEPMITO_ROOT to point to the deepmito installation dir:

$ export DEEPMITO_ROOT='/path/to/deepmito'

Before running the program, you need to install DeepMito dependencies. We suggest to use Conda (we suggest Miniconda3) create a Python virtual environment and activate it.

To create a conda env for deepmito:

$ conda create -n deepmito

To activate the environment:

$ conda activate deepmito

The following Python libraries/tools are required:

  • biopython (version 1.78)
  • Keras (version 2.4.3)
  • Tensorflow (version 2.2)
  • blast

To install all requirements run the following commands:

$ conda install --yes nomkl keras==2.4.3 biopython==1.78 tensorflow==2.2.0
$ conda install blast -c bioconda

Now you are able to use deepmito (see next Section). Remember to keep the environment active. If you wish, you can copy the “deepmito.py” script to a directory in the users' PATH.

Usage

To show DeepMito help page run:

$ ./deepmito.py -h

usage: deepmito.py [-h] {multi-fasta,pssm} ...

DeepMito: Predictor of protein submitochondrial localization

optional arguments:
  -h, --help          show this help message and exit

subcommands:
  valid subcommands

  {multi-fasta,pssm}  additional help
    multi-fasta       Multi-FASTA input module
    pssm              PSSM input module (one sequence at a time)

The program can be run in two different modes:

  • multi-fasta mode, accepting a FASTA file in input containing one or more sequences. In this mode, DeepMito internally computes a sequence profile using PSIBLAST for each sequence in the input file and then predicts sub-mitochondrial localization.
  • pssm mode, accepting a FASTA file containing a single protein sequence and a pre-computed PSSM file obtained by PSI-BLAST (using -out_ascii_pssm option). In this case, the computation of the sequence profile is skipped. The provided PSSM must be generated from the input sequence (an exception is raised otherwise). Only a single protein sequence can be processed in this mode.

Multi-fasta mode

The show the DeepMito help in multi-fasta mode run:

$ deepmito.py multi-fasta -h

usage: deepmito.py multi-fasta [-h] -f FASTA -d DBFILE -o OUTF

DeepMito: Multi-FASTA input module.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        The input multi-FASTA file name
  -d DBFILE, --dbfile DBFILE
                        The PSIBLAST DB file
  -o OUTF, --outf OUTF  The output GFF3 file

Three arguments are accepted:

  • The full path of the input FASTA file containing protein sequences to be predicted;
  • The output GFF3 file where predictions will be stored;
  • The database used to generate sequence profiles using PSI-BLAST.

Let's now try a concrete example. First of all, let's downlaod an example sequence from UniProtKB, e.g. Q9NX14:

$ wget https://www.uniprot.org/uniprot/Q9NX14.fasta

Then, we need a large sequence database for building the Multiple Sequence Alignment (MSA) internally used by DeepMito. The MSA will be generated by searching the sequence database for proteins similar to our input protein. The search is internally performed using PSI-BLAST. Since we are using Docker, you don't need to have PSI-BLAST installed in your machine: all requirements are encapsulated into the DeepMito container image!

In our servers, we use the Uniref90 database (release March, 2018). Hence, to reproduce web server output you need to grab this release from the Uniprot website. In this tutorial, for simplicity, we will adopt a different (smaller) database, namely the latest release of UniprotKB/SwissProt. To get it run:

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
$ gunzip uniprot_sprot.fasta.gz

Now, we are ready to predict the sub-mitochondrial localization of our input protein. Run:

$ docker run -v $(pwd):/data/ -v $(pwd):/seqdb/ bolognabiocomp/deepmito -f Q9NX14.fasta -d uniprot_sprot.fasta -o Q9NX14.out

After running DeepMito, a database index is generated (using makeblastdb) for the input database, if not present.

The file Q9NX14.out now contains the SChloro prediction in GFF3 format as detailed above:

$ cat Q9NX14.out
##gff-version 3
Q9NX14	DeepMito	Mitochondrion inner membrane	1	153	0.69	.	.	Ontology_term:GO:0005743;evidence=ECO:0000256

PSSM mode

The show the DeepMito help in pssm mode run:

$ ./deepmito.py pssm -h

usage: deepmito.py pssm [-h] -f FASTA -p PSIBLAST_PSSM -o OUTF

DeepMito: PSSM input module.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        The input FASTA file name (one sequence)
  -p PSIBLAST_PSSM, --pssm PSIBLAST_PSSM
                        The PSIBLAST PSSM file
  -o OUTF, --outf OUTF  The output GFF3 file

Three arguments are accepted:

  • The full path of the input FASTA file containing protein sequences to be predicted;
  • The output GFF3 file where predictions will be stored;
  • A PSSM file previously generated with PSI-BLAST.

With the protein in the example above (Q9NX14) and the sequence database (uniprot_sprot.fasta), we can create a PSSM file using PSI-BLAST:

$ psiblast -query Q9NX14.fasta -db uniprot_sprot.fasta -out_ascii_pssm Q9NX14.pssm -evalue 0.001 -num_iterations 3

The generated PSSM can be now used as input to DeepMito in pssm mode:

$ ./deepmito.py -f Q9NX14.fasta -p Q9NX14.pssm -o Q9NX14.out

The file Q9NX14.out now contains the DeepMito prediction in GFF3 format as detailed above.

Please, reports bugs to: [email protected]

About

DeepMito - Prediction of protein sub-mitochondrial localization using deep learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published