Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Add to GFF explanation
  • Loading branch information
elileka authored Jun 11, 2024
1 parent 5f39b6a commit de5a267
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ Of note, for simplicity, MetaEuk considers only ATG as a start for this scan.

##### The MetaEuk GFF:

In addition to writing a Fasta file, MetaEuk writes a GFF file. Please note that GFF is not perfectly suitable for MetaEuk because MetaEuk doesn't predict non-coding regions. This means that the MetaEuk gene starts and ends where the first and last codons could be matched. The gene and mRNA categories are the same in the MetaEuk GFF. The exon and CDS coordinates will be the same unless a small target overlap was allowed, due to which, the MetaEuk exon was shortened (see above). In this case, the CDS will report the shortening. In the sixth column you can find their individual bitsocres. The contig index starts at 1 and the start coordinate is always smaller than the end coordinate, as required by GFF. The last column contains the **TCS** identifier, followed by the low_coord of the prediction to support searching for sub-optimal exon sets (see section). Here is an example where a MetaEuk header of two exons is reported in GFF format:
In addition to writing a Fasta file, MetaEuk writes a GFF file. Please note that GFF is not perfectly suitable for MetaEuk because MetaEuk doesn't predict non-coding regions. This means that by default the MetaEuk gene starts and ends where the first and last codons could be matched (or slightly padded if `--len-scan-for-start` is set to be positive, see section). The gene and mRNA categories are the same in the MetaEuk GFF (if `--len-scan-for-start` is set to be positive, these fields will reflect the padding, as explained). The exon and CDS coordinates will be the same unless a small target overlap was allowed, due to which, the MetaEuk exon was shortened (see above). In this case, the CDS will report the shortening. In the sixth column you can find their individual bitsocres. Unlike MetaEuk's native report in the Fasta header, the contig index starts at 1 and the start coordinate is always smaller than the end coordinate, as required by GFF. The last column contains the **TCS** identifier, followed by the low_coord of the prediction to support searching for sub-optimal exon sets (see section). Here is an example where a MetaEuk header of two exons is reported in GFF format:

*>protein_acc|contig_acc|-|508|1.15e-150|2|100|911|911[911]:582[582]:330[330]|501[501]:100[100]:402[402]*

Expand Down

0 comments on commit de5a267

Please sign in to comment.