Latest News:
(08.02.2021) ANGEL is no longer actively supported. Use at your own risk. We recommend either switching to GeneMark or if you have a reference genome & annotation run SQANTI3 which includes ORF prediction as part of the process.
(08.26.2020) ANGEL is now using Python 3.7+!
(06.27.2014) See this talk on validating PacBio transcript-based ORF predictions with mass spec data! ANGEL was used for creating the set of ORF predictions from the MCF-7 dataset. G. Sheynkman: Building a "perfect" proteomics database using PacBio MCF-7 transcriptome data
Last Updated: 08/02/2021
Current version: 3.0
===
The program is divided into three parts :
-
Dumb ORF prediction: ORF prediction that outputs the longest ORF (or first ORF) in all frames. Can be used to create a top training dataset.
-
ANGEL classifier training: Train a coding potential classifier based on given training data. Must provide both CDS and UTR for positive and negative training set.
-
ANGEL robust ORF prediction: ORF prediction based on both the ANGEL classifier and the dumb ORF prediction.
The ANGEL classifier is based on the classifier described in Shimizu et al., "ANGLE: a sequencing errors resistant program for predicting protein coding regions in unfinished cDNA.", J Bioinform Comput Biol, (2006). The naming change to ANGEL is intentional to differentiate it from the author implementation of ANGLE.
NOTE: for both dumb and ANGEL ORF prediction, it is recommended that the input sequences have at least 99% accuracy. This means either short read assembled transcripts or for PacBio, the output from running the Iso-Seq pipeline which are Quiver-polished high-quality consensus sequences. ANGEL has not been tested on PacBio subread-level or ReadsOfInsert-level sequences.
You need to install CD-HIT and have it available in your $PATH
variable to run dumb ORF prediction.
The python dependencies for ANGEL are:
- numpy
- Biopython
- scikit-learn
You can install them using several ways, but the most recommended one is using package manager Anaconda (similar to what is done for Cogent).
In fact, if you already have Cogent installed, the only thing you need to do is activate the environment and install scikit-learn
.
We will first install Anaconda and create a virtualenv under it. The install directions are the same as those for Cogent, so I'm naming the environment AnaCogent
as well.
(1) Download Anaconda latest version.
(2) Install Anaconda according to tutorial
bash ~/Anaconda3-2020.02-Linux-x86_64.sh
export PATH=$HOME/anaconda3/bin:$PATH
For the export
line, you may want to add them to .bashrc
or .bash_profile
in your home directory. Otherwise you will need to type it everytime you log in.
(3) Confirm conda is installed and update conda
conda -V
conda update conda
(4) Create a conda environment. I will call it anaCogent
. Type y
to agree to the interactive questions.
conda create -n anaCogent python=3.7 anaconda
source activate anaCogent
Once you have activated the virtualenv, you should see your prompt changing to something like this:
(anaCogent)-bash-4.1$
(5) Install additional libraries that we need
conda install -n anaCogent biopython
conda install -n anaCogent scikit-learn
conda install -n anaCogent -c anaconda cython
(6) Clone ANGEL repo and install
git clone https://github.com/Magdoll/ANGEL.git
cd ANGEL
python setup.py build
python setup.py install
The alternative to Anaconda is to set up and activate a virtual environment before installation. See here for installation details.
Once you have the virtualenv activated, install the dependencies using pip
:
pip install numpy
pip install biopython
pip install scikit-learn
You can download this GitHub repository in many ways. Here we assume you will be using git clone
:
git clone https://github.com/Magdoll/ANGEL.git
cd ANGEL
python setup.py build
python setup.py install
dumb_predict.py
takes as input a FASTA file. It outputs the longest ORF (or first ORF) that exceed the user-defined minimum length and have a positive log-odds scores based on hexamer frequencies.
Usage:
dumb_predict.py <fasta_filename> <output_prefix>
[--min_aa_length MIN_AA_LENGTH]
[--use_firstORF]
[--use_rev_strand] [--cpus CPUS]
See the provided training example below. By default, only the forward strand is used. This is especially true for PacBio transcriptome sequencing output that often already has correct strand.
cd ANGEL/training_example
dumb_predict.py test.human_1000seqs.fa test.human.dumb --min_aa_length 300 --cpus 24
The output consists of test.human.dumb.final.pep
, test.human.dumb.final.cds
and test.human.dumb.final.utr
, which are the results of longest ORF prediction.
By default, the longest ORF (regardless of frame) is chosen as output. This longest ORF must exceed the minimum length (--min_aa_length
) and also have a positive log-odds score. If the --use_firstORF
option is turned on, however, then no log-odds score is computed; instead, the first ORF (regardless of frame) that exceeds the minimum length is output. The --use_firstORF
option is useful in cases where users believe the first ORF, and not the longest ORF, is the more correct prediction.
Redundancy in the training dataset, such as highly identical CDS sequences from alternative isoforms or homologous genes, can skew the classifier training. The script angel_make_training_set.py
clusters an input set of CDS sequences into non-redundant clusters, and outputs a selective subset for training data.
Note that the training dataset does not have to be the same as the input to ANGEL ORF prediction below. You can use any curated dataset like Gencode, RefSeq, or others so long as they are from the same or similar species so that the classifier can be trained properly.
Usage:
angel_make_training_set.py <input_prefix> <output_prefix>
[--use_top USE_TOP] [--random] [--cpus CPUS]
Using the output from dumb ORF prediction above, the command is:
angel_make_training_set.py test.human.dumb.final test.human.dumb.final.training --random --cpus 24
Here we use the --random
parameter to randomly select non-redundant CDS sequences, instead of choosing the longest CDS sequences.
The output files are: test.human.dumb.final.training.cds
, test.human.dumb.final.training.utr
and test.human.dumb.final.training.pep
.
angel_train.py
takes a CDS FASTA file and a UTR FASTA file and outputs a trained classifier pickle file.
Usage:
angel_train.py <cds_filename> <utr_filename> <output_pickle> [--cpus CPUS]
Example:
angel_train.py test.human.dumb.final.training.cds test.human.dumb.final.training.utr \
test.human.dumb.final.classifier.pickle --cpus 12
On a typical 500-sequence training dataset, the training may take several hours.
NOTE I have found that sometimes angel_train.py
hangs if you use more than 12 cores (regardless of memory usage), possibly due to issues with Python's multiprocessing.
Usage:
angel_predict.py <input_fasta> <classifier_pickle> <output_prefix>
[--min_angel_aa_length] [--min_dumb_aa_length]
[--max_angel_secondORF_distance]
[--use_rev_strand] [--output_mode [best,all]]
[--cpus]
For each sequence, ANGEL uses the classifier to predict the coding potential of each codon in each of the three frames, then finds the most likely stretch of window that is the open reading frame. It then does the following:
- If there is only one predicted ORF, tag it as
confident
. In this case, it is unlikely there are sequencing errors. - If there are multiple ORFs but only one exceeds
min_angel_aa_length
threshold, output that one ORF and tag it aslikely
. In this case, the program is semi-positive that there are no sequencing errors or that the error occurs at the ends of the CDS, allowing successful prediction of a relatively long continuous ORF. - If there are multiple ORFs (which could be in multiple frames) exceeding the length threshold, output all of them (provided the later ORFs are close enough to the upstream ORFs by the threshold determined by
max_angel_secondORF_distance
) and tag them assuspicious
. In this case, the ORFs could be genuine complete ORFs (indicating a polycistronic transcript or alternative ORFs) or fragments of the same ORF that has frame shift due to uncorrected sequencing errors.
The longest ORF length from the ANGEL process is recorded as T
. Then, the same dumb ORF prediction (which simply looks for the longest stretch of ORF without stop codon interruption) is done on the forward three frames. The longest predicted dumb ORF is also outputted if, and only if, its length is greater than both T
and min_dumb_aa_length
. This is a fallback process in case the ANGEL classifier failed to detect coding potential in the CDS region.
If --use_rev_strand
is used, then the same process is repeated on the reverse-complement of the sequence.
The --output_mode
option is used to determine what to output when there are ORF predictions from both ANGEL and dumb and possibly the reverse strand (if --use_rev_strand
is on).
By default, the --output_mode=best
will pick the longest ORF, whether that be from ANGEL or dumb, plus or reverse (if --use_rev_strand
is on) strand.
If --output_mode=all
, then all predictions will be output as long as they are longer than the threshold length (for ANGEL, min_angel_aa_length
and for dumb, min_dumb_aa_length
).
The --max_angel_secondORF_distance
parameter is used only in the suspicious
case (cases where there is more than one ORF predicted) to determine whether or not to output the downstream ORFs based on how close they are to the upstream ORFs. The reasoning is, if the ORF prediction is broken up because of sequencing errors or remaining base errors, then the two ORFs should be very close to each other (because ANGEL expects there to be only a handful of base errors, not long stretches of errors). If the second ORF is very far from the first ORF (say more than 20 bp downstream), then it is not likely to be from sequencing errors. Rather the first ORF (with its stop codon) is the correct prediction. By default max_angel_secondORF_distance
is set to 10 bp. So any downstream ORF that is further than 10 bp away from the upstream ORF will be discarded.
Example:
angel_predict.py test.human_1000seqs.fa human.MCF7.random_500_for_training.pickle test.human \
--use_rev_strand --output_mode=best --min_angel_aa_length 100 --min_dumb_aa_length 100
The output files are: <output_prefix>.ANGEL.cds
, <output_prefix>.ANGEL.pep
, and <output_prefix>.ANGEL.utr
.
Each output sequence ID has the format:
<seq_id> type:<tag>-<completeness> len:<ORF length (aa)> strand:<strand> pos:<CDS range>
Where tag
is confident
, likely
, or suspicious
for ANGEL predictions, and dumb
for dumb ORF predictions.
completeness
is either complete
, 5partial
, 3partial
, or internal
based on the presence or absence of start and stop codons.
If a high-quality genome is available and you want to additionally run ANGEL on a genomic version of the transcripts, you can use the err_correct_w_genome.py
script from cDNA_Cupcake. cDNA_Cupcake is a light-weight repository for simple manipulation scripts. See the err_correct_w_genome tutorial on how to obtain a genomic version first.
Once you have the genome-corrected version, you can run ANGEL separately on it using either just dumb_predict.py
(if you trust the genome bases and alignment exon boundaries to be precise) or angel_predict.py
(to still allow for some errors).
With two versions of ANGEL output (one from the PacBio version, one from the genomic version), you can use the script angel_compare_files.py
to pick the longest ORF for each sequence from the two files.
angel_comapre_files.py <ANGEL_prefix1> <ANGEL_prefix2> <output_prefix>
Below is an example:
# run ANGEL on PacBio sequence, now obtain test1.cds, test1.utr, test1.pep
# run ANGEL on genomic sequence, now obtain test2.cds, test2.utr, test2.pep
angel_compare_files.py test1 test2 test3
The result will be output to: test3.cds, test3.pep, test3.utr, and test3.compare.txt
#################################################################################$$
# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
#
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted (subject to the limitations in the
# disclaimer below) provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above
# copyright notice, this list of conditions and the following
# disclaimer in the documentation and/or other materials provided
# with the distribution.
#
# * Neither the name of Pacific Biosciences nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#################################################################################$$