Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different expected and conceptual translations for some partial CDS #340

Closed
tpillone opened this issue Oct 17, 2024 · 4 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@tpillone
Copy link

Hi,

I'm trying to submit embl flatfiles generated by bakta to the EBI but I am confronted with multiple errors related to the translation of some partial CDS. It seems that some partial CDS have unexpected translated sequences. The message raise by the webin validation tool is:

ERROR: Expected and conceptual translations are different. [ line: 167841 of 4_GEN12896.embl.gz]

If I translate those CDS with biopython, I systematically get one extra amino acid as compared to the translation reported by bakta. Here are few examples (from this embl file):

contig_113 <0:1356
biopython LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKRK
bakta LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKR
contig_127 <0:831
biopython LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIVH
bakta LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIV
contig_129 <1:808
biopython TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELLE
bakta TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELL
contig_133 <2:638
biopython QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAVH
bakta QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAV

Biopython code to get the translation:

trans = feature.translate(record.seq, cds=False, stop_symbol="")

  • This genome was annotated with bakta 1.9.4 with but I had similar issues with older versions.
  • They are many other partial CDS that do not exhibit this problem

Any idea what might be causing this problem?

Best,
Trestan

@tpillone tpillone added the bug Something isn't working label Oct 17, 2024
@oschwengers
Copy link
Owner

oschwengers commented Oct 18, 2024

Hi Trestan, thanks for reporting.
This looks both odd and interesting, and I'd like to take a deeper look at this. Could you please run this setting the --debug flag (if not set already) and provide also the log and json file?

@tpillone
Copy link
Author

tpillone commented Oct 24, 2024

Here are the log and json files in debug mode:

GEN12896.json
GEN12896.log

Thank you,
Trestan

@oschwengers oschwengers added this to the v1.10.0 milestone Oct 29, 2024
@oschwengers oschwengers self-assigned this Oct 29, 2024
@oschwengers
Copy link
Owner

OK, I took a deeper look into this. Indeed, for partial CDS that are truncated at both sides, the trailing amino acids is accidentally clipped as this is perceived as a trailing * which normally encodes a stop codon. Of course, in these cases, there is no such stop codon which causes the too short aa sequences.

I think this is a crucial bug, but occurs only in such rare edge cases that for most applications might not be that relevant (at least I hope so), as most analyses probably do not take into account annotated protein sequences that are truncated at both sides, anyway.

I committed a fix which will be available soon with the upcoming version 1.10.0.

Again, thank you very much for reporting this! I'll close this for now, but please do not hesitate to re-open it if you have any further questions or feedback.

@tpillone
Copy link
Author

tpillone commented Nov 5, 2024

Hi @oschwengers
Thank you for the fix!
Looking more closely at those genomes, I finally understood why this rare edge case happened several times. It turns out that I re-annotated fasta files previously annotated with an older version of Bakta with a wrong --complete flag. Bakta fasta headers had the following format:
>contig_1 [completeness=complete] [topology=linear] [gcode=11]
As Bakta considers sequences as circular when the keyword "complete" is detected in the fasta header (

bakta/bakta/utils.py

Lines 406 to 408 in 518253a

elif('complete' in sequence_description and 'complete=false' not in sequence_description): # detection of public/described sequences
seq['complete'] = True
seq['topology'] = bc.TOPOLOGY_CIRCULAR
), all contigs were considered as circular, leading to the identification or a lot of CDS spanning contigs ends, including CDS truncated on both ends.
The origin of that specific problem was the wrong "`--complete" flag, but chromosomes and plasmids can be linear, is it really safe to try to guess the topology from the content of the header?
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants