Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on GenBank file with CompoundLocation overlapping sequence origin #324

Closed
aekazakov opened this issue Sep 19, 2024 · 4 comments · Fixed by #332
Closed

Error on GenBank file with CompoundLocation overlapping sequence origin #324

aekazakov opened this issue Sep 19, 2024 · 4 comments · Fixed by #332
Assignees
Labels
bug Something isn't working
Milestone

Comments

@aekazakov
Copy link

Hi! Thank you for the great tool!

On a genome that has a coding gene overlapping circular sequence origin, Bakta terminates with error.

I run Bakta version 1.9.4 installed with conda, database version 5.1.

The command executed is:
bakta --debug --db /mnt/data/ref/Bakta/v5.1/db --output test_bakta --prefix NZ_CP0128315.bakta --threads 8 --regions sequence.gb sequence.fasta

Input files sequence.gb and sequence.fasta for the NCBI sequence NZ_CP012831.1 were downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP012831.1 and contain full Genbank record and FASTA-formatted nucleotide sequence.

This sequence has a gene and CDS features overlapping sequence origin:

     gene            complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
     CDS             complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
                     /EC_number="[1.1.1.14](https://enzyme.expasy.org/EC/1.1.1.14)"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_004415564.1"
                     /GO_function="[GO:0016491 - oxidoreductase activity](http://amigo.geneontology.org/amigo/term/GO:0016491)
                     [Evidence IEA]"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=[11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11)
                     /product="L-iditol 2-dehydrogenase"
                     /protein_id="[WP_013692788.1](https://www.ncbi.nlm.nih.gov/protein/503458127)"
                     /translation="MKRLEGKSALITGSARGIGRSFAQAYIGEGATVAIADINLERAQ
                     ATASELGPQAYAVEMDVTRQASIDAAIAEVVAHAGKLDILVNNAALFDLAPITEITRE
                     SYDKLFAINVSGTLFTLQAAAKQMISQGHGGKIINMASQAGRRGEALVGVYCATKAAV
                     ISLTQSAGLNLIKHKINVNAIAPGVVDGEHWDGVDALFAKHENRPLGEKKRLVGLEVP
                     YGRMGTAEDLVGMAIFLAGSESDYIVAQTYNVDGGNWMS"

Error message from NZ_CP0128315.bakta.log is:


11:32:33.190 - DEBUG - MAIN - revise translational exceptions
11:32:33.202 - DEBUG - MAIN - import user-provided CDS regions
11:32:34.399 - ERROR - CDS - user-provided CDS: CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]
11:32:34.428 - ERROR - CDS - user-provided CDS: regions/features file GenBank format not valid!
Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 290, in import_user_cdss
    aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 1620, in translate
    _translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 2871, in _translate_str
    raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Final codon 'ATC' is not a stop codon

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 295, in import_user_cdss
    raise ValueError(f"User-provided CDS could not be translated into a valid amino acid sequence! contig={user_cds['contig']}, start={user_cds['start']}, stop={user_cds['stop']}, cds={nt}")
ValueError: User-provided CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]

This error occurs because the CDS sequence extracted by extract_feature_sequence (bakta/utils.py) is wrong. It is 7.1 Mbp long instead of 774 bp.

@aekazakov aekazakov added the bug Something isn't working label Sep 19, 2024
@oschwengers oschwengers self-assigned this Sep 20, 2024
@oschwengers oschwengers added this to the v1.10.0 milestone Sep 20, 2024
@oschwengers
Copy link
Owner

Hi @aekazakov and thanks for the detailed report!
I'll have a look at it and try to include this into the upcoming v1.10.0 release.

@oschwengers
Copy link
Owner

OK, this should be fixed by #332.

@aekazakov If you like, and feel comfortable with being a very-early-tester, you could checkout the fix-edge-user-proteins branch at https://github.com/oschwengers/bakta/tree/fix-edge-user-proteins and give it a try.

@oschwengers oschwengers linked a pull request Oct 2, 2024 that will close this issue
@aekazakov
Copy link
Author

I tested the fix-edge-user-proteins branch with two RefSeq genomes (NZ_CP012831.1 and NZ_CP015511.1) that have gene and CDS features overlapping sequence origin on - and + strand, respectively.
For both genomes, the GenBank files were imported without errors, and the genes overlapping origin were reported correctly in Bakta output files.
Thank you for fixing this bug!

@oschwengers
Copy link
Owner

Thanks for testing and reporting back. With that, I'll close this. If there are any further things to discuss or issues arising from this, please do not hesitate to re-open this or a new one. Thanks.

oschwengers added a commit that referenced this issue Oct 3, 2024
* fix user-provided edge cds #324
* adopt overlap filter for user-provided edge cds #324
* refactor edge cds creation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants