Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gff3 issue #3

Open
wants to merge 58 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
9214986
Update README.md
LarsGab May 7, 2021
708968c
Update README.md
LarsGab May 20, 2021
db602b4
Create LICENSE.TXT
LarsGab May 21, 2021
6a8b7f7
Rename LICENSE.TXT to LICENSE.txt
LarsGab May 21, 2021
4af4633
Update README.md
LarsGab May 21, 2021
ef59c55
Update README.md
LarsGab May 21, 2021
14a8fad
Update README.md
LarsGab Jun 4, 2021
46080da
Update README.md
LarsGab Jun 8, 2021
984d96d
Update README.md
LarsGab Jun 9, 2021
4a47baf
add 2nd config file
LarsGab Jun 28, 2021
7cbcfda
Merge branch 'main' of https://github.com/Gaius-Augustus/TSEBRA into …
LarsGab Jun 28, 2021
6c8866a
Changed lineterminator of output from "\r\n" to "\n"
LarsGab Jul 5, 2021
f126c49
added "gene" line to output
LarsGab Jul 6, 2021
bbc810c
Added script rename_gtf.py to rename TSEBRA outputs.
Jul 16, 2021
1f167ad
Update README.md
LarsGab Jul 16, 2021
304a158
fixed prefix of rename_gtf
Jul 19, 2021
61b988a
fixed adding start/stop codons in genome_anno
LarsGab Aug 5, 2021
879d5c5
added config file
LarsGab Aug 23, 2021
390618e
Create keep_ab_initio.cfg
LarsGab Aug 23, 2021
ee91774
Merge branch 'main' of https://github.com/Gaius-Augustus/TSEBRA into …
LarsGab Aug 24, 2021
4f8e825
reworked documentation about config file
LarsGab Oct 7, 2021
84270af
Update README.md
LarsGab Oct 7, 2021
d263a20
fixed typo in example documentation
LarsGab Oct 13, 2021
95329dd
Merge branch 'main' of https://github.com/Gaius-Augustus/TSEBRA into …
LarsGab Oct 13, 2021
48251b1
fixed UTR format error
LarsGab Nov 1, 2021
f350feb
fixed UTR format error also in rename_gtf
LarsGab Nov 1, 2021
64fe05e
fixed empty spaces in gtf files
LarsGab Nov 8, 2021
336c380
Update README.md
LarsGab Nov 26, 2021
556df12
Update README.md
LarsGab Jan 27, 2022
3780b7d
Update README.md
LarsGab Feb 3, 2022
957476d
Update README.md
LarsGab Feb 3, 2022
4c8a72d
fixed bug finding correct gene clusters of overlapping transcripts
LarsGab Apr 11, 2022
d992b1d
removed print
LarsGab Apr 11, 2022
54e6844
added keep_gtf option
May 2, 2022
b405f14
removed tsebra to require hintfiles
May 2, 2022
ec629e0
fixed geneID bug
LarsGab Jun 1, 2022
df2fc93
Update tsebra.py
LarsGab Aug 13, 2022
847e7ab
Create .gitkeep
LarsGab Aug 14, 2022
eca3f7c
Add files via upload
LarsGab Aug 14, 2022
8c6e934
Add files via upload
LarsGab Aug 14, 2022
7a25c88
Update README.md
LarsGab Aug 14, 2022
cf86504
Add files via upload
LarsGab Aug 14, 2022
238fa57
braker3 update
Oct 26, 2022
1432ff8
add Logo
Oct 26, 2022
0347452
add script for getting longest isoform of all gene loci
Nov 18, 2022
64ecab0
Update README.md
LarsGab Nov 18, 2022
0e6c9bf
fixed get_longest_isoform.py
Nov 22, 2022
2571d01
Update README.md
LarsGab Dec 1, 2022
3abd25d
Update default.cfg
LarsGab Dec 14, 2022
4d0804a
changed default parameter
Jan 11, 2023
1365796
Merge branch 'main' into braker3
LarsGab Jan 11, 2023
ca57ce6
Merge pull request #27 from Gaius-Augustus/braker3
LarsGab Jan 11, 2023
7067946
changed default parameters, added options --score_tab, --filter_singl…
Jan 31, 2023
10e804d
added option to ignore tx phase while detecting overlapping transcripts
Feb 20, 2023
2e8ace5
update for braker3
Mar 1, 2023
bdcbb0a
Update README.md
LarsGab Mar 1, 2023
b0d6c4f
Update README.md
LarsGab Mar 3, 2023
7f4b7fb
help with gff issue
Apr 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 111 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,29 @@
# TSEBRA: Transcript Selector for BRAKER

<p align="center">
<img src="docs/TSEBRA_Logo.png" alt="drawing" width="700"/>
</p>

### Introduction
TSEBRA is a combiner tool that selects transcripts from gene predictions based on the support by extrisic evidence in form of introns and start/stop codons. It was developed to combine BRAKER1<sup name="a1">[1](#ref1)</sup> and BRAKER2<sup name="a2">[2](#ref2)</sup> predicitons to increase their accuracies.
[TSEBRA](https://doi.org/10.1186/s12859-021-04482-0) is a combiner tool that selects transcripts from gene predictions based on the support by extrisic evidence in form of introns and start/stop codons. It was developed to combine BRAKER1<sup name="a1">[1](#ref1)</sup> and BRAKER2<sup name="a2">[2](#ref2)</sup> predicitons to increase their accuracies.

## Prerequisites
Python 3.5.2 or higher is required.

## Installation
Download TSEBRA:
```console
git clone https://github.com/LarsGab/TSEBRA
```
Or download TSEBRA as submodule of BRAKER with:
```console
git clone --recurse-submodules https://github.com/Gaius-Augustus/BRAKER
git clone https://github.com/Gaius-Augustus/TSEBRA
```

## Usage
The main script is ```./bin/tsebra.py```. For usage information run ```./bin/tsebra.py --help```.

## Input Files
TSEBRA needs a list of gene prediciton files, a list of hintfiles and a configuration file as input.
TSEBRA takes a list of gene prediciton files, a list of hintfiles and a configuration file as mandatory input.

#### Gene Predictions
The gene prediction files needs to be in gtf format. This is the standard output format of a BRAKER or AUGUSTUS<sup name="a3">[3,](#ref3)</sup><sup name="a4">[4](#ref4)</sup> gene prediciton.
#### Gene Predictions
The gene prediction files have to be in gtf format. This is the standard output format of a BRAKER or AUGUSTUS<sup name="a3">[3,](#ref3)</sup><sup name="a4">[4](#ref4)</sup> gene prediciton.

Example:
```console
Expand All @@ -34,7 +35,7 @@ Example:
```

#### Hint Files
The hint files have to be in gff format, the last column must include an attribute for the source for the hint with 'src=' and can include the number of hints supporting the gene structure segment with 'mult='. This is the standard file format of the ```hintfiles.gff``` in a BRAKER working directory.
The hints files have to be in gff format, the last column must include an attribute for the source for the hint with 'src=' and can include the number of hints supporting the gene structure segment with 'mult='. This is the standard file format of the ```hintsfile.gff``` in a BRAKER working directory.

Example:
```console
Expand All @@ -46,31 +47,59 @@ Example:
```

#### Configuration File
The configuration file has to include three types of parameter:
1. The weight for each hint source. A weight is set to 1, if the weight for a source is not determined in the cfg file.
2. Required fraction of supported introns or supported start/stop-codons for a transcript.
3. Allowed difference between two overlapping transcripts for each feature type.
The configuration file has to include three different sets of parameter:
1. Weights for all sources of hints. The source of a hint is specified by the mandatory 'src=' attribute in the last column of the ```hintsfile.gff``` (see section 'Hint Files'). See section 'Transcript scores' in [TSEBRA](https://doi.org/10.1101/2021.06.07.447316) for more information on how these weigths are used.
A weight is set to 1, if the weight for a hint source is not specified in the configuration file.

* *Notes on adjusting these parameters: Increase the weight of the hint sources that have the highest quality. For example, if the protein database includes only species that are remotely related to the target species, the hints produced by BRAKER2 might be less accurate than the RNA-seq evidence. Then, you should increase the weight of the source related to the RNA-seq hints.*


2. Required fractions of supported introns or supported start/stop-codons for a transcript. A transcript is not included in the TSEBRA result if the fractions of introns and start/stop codons supported by extrinsic evidence are lower than the thresholds.

* *Notes on adjusting these parameters: The low evidence support thresholds for low evidence support are quite strict in the default configuration file. In this configuration, only transcripts with very high evidence support are allowed in the TSBERA result. In some cases, the default setting might be too strict, so that too many transcripts are filtered out. In this case, you should reduce the threshold of 'intron_support' (e.g., to 0.2).*


3. Allowed difference between two overlapping transcripts for the six transcript scores. TSEBRA compares transcripts via their transcript scores and removes the one with the lower score if their difference exceeds the respective threshold.
Note that it is recommended to choose thesholds between [0,2], since the transcript scores are normalized to [-1,1].

* *Notes on adjusting these parameters: The higher the thresholds are set the less transcripts are filtered by the respective rule. With these thresholds one can adjust the effect of each filtering rule of TSEBRA. As these thresholds are increased, more transcripts are included in the TSEBRA result, in particular, more alternatively spliced isoforms per gene are contained in the result.*



The name and the value of a parameter are separated by a space, and each parameter is listed in a different line.
Example:
```console
# src weights
P 0.1
E 10
C 5
# Weight for each hint source
# Values have to be >= 0
P 1
E 1
C 1
M 1
# Low evidence support
intron_support 0.75
# Required fraction of supported introns
# or supported start/stop-codons for a transcript
# Values have to be in [0,1]
intron_support 0.8
stasto_support 1
# Feature differences
e_1 0
# Allowed difference for each feature
# Values have to be in [0,2]
e_1 0.0
e_2 0.5
e_3 25
e_4 10
e_3 0.096
e_4 0.02
e_5 0.18
e_6 0.18
```
Description of evidence sources in default BRAKER1 and BRAKER2 outputs:
```
E = RNA-seq hints
M = manual hints, these are hints that are enforced during the prediction step of BRAKER,
C = protein hints from proteins with a 'high' spliced alignment score.
P = protein hints from proteins that have a 'good' spliced alignment score,
but that is lower than the score from the ones in 'C'.
```


## Use Case
The recommended and most common usage for TSEBRA is to combine the resulting ```braker.gtf``` files of a BRAKER1 and a BRAKER2 run using the hintsfile.gff from both working directories. However, TSEBRA can be applied to any number (>1) of gene predictions and hint files as long as they are in the correct format.
The recommended and most common usage for TSEBRA is to combine the resulting ```augustus.hints.gtf``` files of a BRAKER1 and a BRAKER2 run using the hintsfile.gff from both working directories. However, TSEBRA can be applied to any number (>1) of gene predictions and hint files as long as they are in the correct format.

A common case might be that a user wants to annotate a novel genome with BRAKER and has:
* a novel genome with repeats masked: ```genome.fasta.masked```,
Expand All @@ -80,32 +109,75 @@ A common case might be that a user wants to annotate a novel genome with BRAKER
1. Run BRAKER1 and BRAKER2 for example with
```console
### BRAKER1
braker.pl --genome=genome.fasta.masked --hints=rna_seq_hints.gff \
braker.pl --genome=genome.fasta.masked --hints=rna_seq_hints.gff \
--softmasking --species=species_name --workingdir=braker1_out

### BRAKER2
braker.pl --genome=genome.fasta.masked --prot_seq=proteins.fa \
--softmasking --species=species_name --epmode --prg=ph \
braker.pl --genome=genome.fasta.masked --prot_seq=proteins.fa \
--softmasking --species=species_name --epmode \
--workingdir=braker2_out
```
2. Make sure that the gene and transcript IDs of the gene prediction files are in order (this step is optional)
```console
./bin/fix_gtf_ids.py --gtf braker1_out/braker.gtf --out braker1_fixed.gtf
./bin/fix_gtf_ids.py --gtf braker2_out/braker.gtf --out braker2_fixed.gtf
```
3. Combine predicitons with TSEBRA
2. Combine predicitons with TSEBRA
```console
./bin/tsebra.py -g braker1_fixed.gtf,braker2_fixed.gtf -c default.cfg \
./bin/tsebra.py -g braker1_out/augustus.hints.gtf,braker2_out/augustus.hints.gtf -c default.cfg \
-e braker1_out/hintsfile.gff,braker2_out/hintsfile.gff \
-o braker1+2_combined.gtf
```
The combined gene prediciton is ```braker1+2_combined.gtf```.

## Example
A small example is located at ```example/```. Run ```./example/run_prevco_example.sh``` to execute the example and to check if TSEBRA runs properly.
A small example is located at ```example/```. Run ```./example/run_prevco_example.sh``` to execute the example and to check if TSEBRA runs properly.

## Enforcing a gene set
A gene set can be enforced in the TSEBRA output, i.e. all transcript are guaranteed to be included in the output, with the `--keep_gtf` option. The transcripts of enforced gene sets are still compared to all gene sets and used to evaluate them.
Example:
```console
./bin/tsebra.py -g gene_set1,gene_set2 -c default.cfg \
-k enforced_set1,enforced_set2 -e hintsfile1.gff,braker2_out/hintsfile2.gff \
-o tsebra.gtf
```

## Filter single-exon genes out
In default mode, TSEBRA is conservative in filtering single exon genes out. In some cases BRAKER predicts a lot of false positive single exon genes. In these cases, it is recommended to run TSBERA using the `--filter_single_exon_genes`. In this mode, TSBERA filters additonally all single-exon genes out that have no support by a start or stop codon hint.

## Print transcript scores
The transcript scores play a very improtant role in TSEBRA. These are used for pairwise comparison of all transcripts isoforms that have overlapping coding regions. You can print the scores as table to a file with the option `--score_tab /path/to/output/file.tab`.

## Ignore Frame
By default, TSEBRA groups all transcript isoforms that have overlapping coding regions in the same open reading frame (phase column in gtf) to candidates of the same gene. However, in some cases, it might be desired to consider already all transcripts with overlapping conding regions (regardless of the reading frame) as candidates for a gene. In this case add the `--ignore_tx_phase` to the TSEBRA commmand.

## Other scripts in the TSEBRA repository

### Renaming transcripts from a TSEBRA output
The IDs of the transcripts and genes in the TSEBRA output can be renamed such that the gene and transcript ID match.
Genes and transcript are numbered consecutively and for example, the second transcript of gene "g12" has the ID "g12.t2".
If a prefix is set then it will be added before all IDs, for example, the transcript ID is "dmel_g12.t2" if the prefix is set to "dmel".
Additionally, a translation table can be produced that provides the mapping from old to new transcript IDs.

Example for renaming ```tsebra_result.gtf```:
```console
./bin/rename_gtf.py --gtf tsebra_result.gtf --prefix dmel --translation_tab translation.tab --out tsebra_result_renamed.gtf
```
The arguments ```--prefix``` and ```--translation_tab``` are optional.

### Fixing the formatting issue of `braker.gtf`
A BRAKER run produces a second complete gene set named `braker.gtf`, besides the official output `augustus.hints.gtf`. The `braker.gtf` is the result of merging `augustus.hints.gtf` with some 'high-confidents' genes from the GeneMark prediction. However, the merging process leads to a formatting issue in `braker.gtf`.
A quick fix for this formatting issue is the script `fix_gtf_ids.py`, e.g.:
```console
./bin/fix_gtf_ids.py --gtf braker_out/braker.gtf --out braker1_fixed.gtf
```
Take note that the `braker.gtf` and `fix_gtf_ids.py` haven't been tested sufficently and there is no guarantee that this gene set is superior to `augustus.hints.gtf`.

### Getting the longest isoform of each gene loci from different gene sets
Combines multiple gene sets and reports the transcript with the longest coding region for each cluster of overlapping transcripts (one transcript per gene loci), e.g.
```console
./bin/get_longest_isoform.py --gtf gene_set1.gtf,gene_set2.gtf --out longest_insoforms.gtf
```

## Licence
All source code, i.e. `bin/*.py` are under the Artistic License (see <https://opensource.org/licenses/Artistic-2.0>).
All source code, i.e. `bin/*.py` are under the [Artistic License](bin/LICENSE.txt) (see <https://opensource.org/licenses/Artistic-2.0>).

## Citing TSEBRA
Gabriel, L., Hoff, K.J., Brůna, T. *et al.* TSEBRA: transcript selector for BRAKER. *BMC Bioinformatics* **22**, 566 (2021). https://doi.org/10.1186/s12859-021-04482-0

## References
<b id="ref1">[1]</b> Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” *Bioinformatics* 32 (5). Oxford University Press: 767--69.[↑](#a1)
Expand Down
Loading