Different expected and conceptual translations for some partial CDS #340

tpillone · 2024-10-17T09:41:12Z

Hi,

I'm trying to submit embl flatfiles generated by bakta to the EBI but I am confronted with multiple errors related to the translation of some partial CDS. It seems that some partial CDS have unexpected translated sequences. The message raise by the webin validation tool is:

ERROR: Expected and conceptual translations are different. [ line: 167841 of 4_GEN12896.embl.gz]

If I translate those CDS with biopython, I systematically get one extra amino acid as compared to the translation reported by bakta. Here are few examples (from this embl file):

contig_113 <0:1356
biopython LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKRK
bakta LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKR
contig_127 <0:831
biopython LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIVH
bakta LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIV
contig_129 <1:808
biopython TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELLE
bakta TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELL
contig_133 <2:638
biopython QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAVH
bakta QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAV

Biopython code to get the translation:

trans = feature.translate(record.seq, cds=False, stop_symbol="")

This genome was annotated with bakta 1.9.4 with but I had similar issues with older versions.
They are many other partial CDS that do not exhibit this problem

Any idea what might be causing this problem?

Best,
Trestan

The text was updated successfully, but these errors were encountered:

oschwengers · 2024-10-18T09:23:03Z

Hi Trestan, thanks for reporting.
This looks both odd and interesting, and I'd like to take a deeper look at this. Could you please run this setting the --debug flag (if not set already) and provide also the log and json file?

tpillone · 2024-10-24T12:47:25Z

Here are the log and json files in debug mode:

GEN12896.json
GEN12896.log

Thank you,
Trestan

oschwengers · 2024-10-29T11:38:58Z

OK, I took a deeper look into this. Indeed, for partial CDS that are truncated at both sides, the trailing amino acids is accidentally clipped as this is perceived as a trailing * which normally encodes a stop codon. Of course, in these cases, there is no such stop codon which causes the too short aa sequences.

I think this is a crucial bug, but occurs only in such rare edge cases that for most applications might not be that relevant (at least I hope so), as most analyses probably do not take into account annotated protein sequences that are truncated at both sides, anyway.

I committed a fix which will be available soon with the upcoming version 1.10.0.

Again, thank you very much for reporting this! I'll close this for now, but please do not hesitate to re-open it if you have any further questions or feedback.

tpillone · 2024-11-05T22:00:30Z

Hi @oschwengers
Thank you for the fix!
Looking more closely at those genomes, I finally understood why this rare edge case happened several times. It turns out that I re-annotated fasta files previously annotated with an older version of Bakta with a wrong --complete flag. Bakta fasta headers had the following format:
>contig_1 [completeness=complete] [topology=linear] [gcode=11]
As Bakta considers sequences as circular when the keyword "complete" is detected in the fasta header (

bakta/bakta/utils.py

Lines 406 to 408 in 518253a

    
           elif('complete' in sequence_description and 'complete=false' not in sequence_description):  # detection of public/described sequences 
        
               seq['complete'] = True 
        
               seq['topology'] = bc.TOPOLOGY_CIRCULAR

), all contigs were considered as circular, leading to the identification or a lot of CDS spanning contigs ends, including CDS truncated on both ends.
The origin of that specific problem was the wrong "`--complete" flag, but chromosomes and plasmids can be linear, is it really safe to try to guess the topology from the content of the header?
Best

tpillone added the bug Something isn't working label Oct 17, 2024

oschwengers added a commit that referenced this issue Oct 29, 2024

fix wrong stop codon clipping in partial CDS #340

3884268

oschwengers added this to the v1.10.0 milestone Oct 29, 2024

oschwengers self-assigned this Oct 29, 2024

oschwengers closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different expected and conceptual translations for some partial CDS #340

Different expected and conceptual translations for some partial CDS #340

tpillone commented Oct 17, 2024

oschwengers commented Oct 18, 2024 •

edited

Loading

tpillone commented Oct 24, 2024 •

edited

Loading

oschwengers commented Oct 29, 2024

tpillone commented Nov 5, 2024

Different expected and conceptual translations for some partial CDS #340

Different expected and conceptual translations for some partial CDS #340

Comments

tpillone commented Oct 17, 2024

oschwengers commented Oct 18, 2024 • edited Loading

tpillone commented Oct 24, 2024 • edited Loading

oschwengers commented Oct 29, 2024

tpillone commented Nov 5, 2024

oschwengers commented Oct 18, 2024 •

edited

Loading

tpillone commented Oct 24, 2024 •

edited

Loading