Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exon coordinates on negative strand #13

Closed
bio-mmanni opened this issue Nov 3, 2020 · 3 comments
Closed

Exon coordinates on negative strand #13

bio-mmanni opened this issue Nov 3, 2020 · 3 comments

Comments

@bio-mmanni
Copy link

bio-mmanni commented Nov 3, 2020

Hi,

We're integrating Metaeuk into BUSCO and by screening the coordinates of the resulting genes I noticed there might be an issue about how the coordinates are reported when there are overlapping exons on the negative strand.

Here is an example of a predicted protein from Metaeuk on the - strand:

72245at7147_8|CH478315.1|-|262|6.275e-72|3|49869|50373|50373[50373]:50308[50308]:66[66]|50242[50242]:50054[50054]:189[189]|50027[50063]:49869[49869]:159[123]
MANKKVDFDSLVPIEPDSAPNKGIVLFGKDLSQIPCFRNSFLYGISIGIGVGFLAFMKTSRPQLSSHIGFGTFCGTVFCYWFPCRIRYKWSKDEKEAEVLKRLMQQQVMYEGTEKERELDRKAESA

From the doc: “The exon_coords are of the structure: low[taken_low]:high[taken_high]:nucleotide_length[taken_nucleotide_length]
Since MetaEuk allows for a very short overlap on T of two putative exons (see P2 and P3 in the illustration below), when joining the sequences of the exons, one of them is shortened. The coordinates of the codons taken from this exon will be in the square brackets ([taken_low], [taken_high] and [taken_nucleotide_length]).”

But according to the coordinates in the brackets, the last two exons overlaps: exon 2 ends at 50054 but exon 3 starts at 50063. The length for the shortened exon “[123]” in the header does not correspond to the values you get if use the coordinates reported: 50063 - 49869 = 194.

Nevertheless, the protein sequence and CDS seems to be correct when I compare it to the reference from which it was predicted. The length of the third exon is 123pb which corresponds to the length reported in the header.

Indeed when I search the original scaffold with the predicted CDS using blast, it seems that the coordinate of the start of the third exon is shifted several bases respect to what is reported in the header, e.g.:

							len 	exon_end     exon_start
Exon3 CH478315.1		Query_52157	100.000	123	49869	49991	

So it’s likely that the problem is only affecting the coordinates in the header and not the predicted sequences.

Could you have a look into this?
Many thanks!

(I’m using metaeuk Version: e7e2d95)

@elileka
Copy link
Member

elileka commented Dec 3, 2020

Hello,

This looks like a problem in how MetaEuk produces the header for this case.
Could you please send me the sequences of your contig (CH478315.1) and of the reference protein (72245at7147_8)?
I apologize in advance - it may take a while - I am on maternity leave.

@elileka
Copy link
Member

elileka commented Dec 3, 2020

I managed to reproduce this on a dummy example and fix it (as of commit f32e8d). Please let me know if it still gives you trouble. Thank you for the feedback!

@bio-mmanni
Copy link
Author

Hi,
Yes, now it works as expected.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants