Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using --proteins, are the sequences only used for annotation? #216

Closed
marade opened this issue Jun 9, 2023 · 10 comments
Closed

When using --proteins, are the sequences only used for annotation? #216

marade opened this issue Jun 9, 2023 · 10 comments
Labels
feature question Further information is requested
Milestone

Comments

@marade
Copy link

marade commented Jun 9, 2023

Greetings. When running Bakta, if Prodigal scores an ORF too low and rejects it, but the correct sequence has been included with --proteins, will Bakta find and annotate the sequence? It appears the answer is no, but perhaps I'm missing something. I'd like to be able to find and annotate some ORFs that Prodigal is rejecting with a score that's too low.

@oschwengers
Copy link
Owner

Hi @marade,
no, Bakta works on the de novo predicted genes from Prodigal/Pyrodigal and then applies several structural filters (overlaps with other feature types, Antifam, etc) to discard some false positives. It does not work on lower scored sequences from Prodigal - albeit I see the point and this might be interesting in some cases.

What Bakta does, it extracts short ORFs (<=30 aa) itself. However, a search for user proteins is currently not yet implemented for sORFs. So if your looking for those, then I could add this to my list.

@marade
Copy link
Author

marade commented Jun 12, 2023

Unfortunately my protein, which is in GenBank in two different forms, is 172 aa or 184 aa long, so it wouldn't qualify as a sORF. When I run Prodigal with '-s' the Total score for the ORF is negative, and judging by the accepted ORFs the Total needs to be positive. Any other ideas on how I might work around this issue?

@marade
Copy link
Author

marade commented Jun 12, 2023

Somewhat related, I am noticing that Pyrodigal looks like a nice drop-in replacement for Prodigal. You might consider using it for additional flexibility and efficiency.

https://pypi.org/project/pyrodigal/

@marade
Copy link
Author

marade commented Jun 12, 2023

Oh sorry, looks like you're already using Pyrodigal. So much for that idea!

@oschwengers
Copy link
Owner

If you're just interested in these special protein sequences and not in the entire genome, you could predict them using Py/Prodigal and then annotate them using bakta_proteins: https://github.com/oschwengers/bakta#protein-bulk-annotation

@marade
Copy link
Author

marade commented Jun 14, 2023

I'm focused on whole genomes, so would need something for those. I suppose this is turning into a feature request? Ideally what I'd like is to be able to add some genes to the cdss variable after this line in main.py:

cdss = feat_cds.predict(genome, contigs_path)

Another, perhaps more difficult, approach would be to try to get Pyrodigal to accept lower scoring putative ORFs. I've already tried training my own Prodigal model, but that didn't get my genes recognized and there are no easy ways to adjust the training other than by feeding it different sequences.

@oschwengers
Copy link
Owner

I totally see your point. However, I think this would require some fairly complex changes to both, the UI and internal logic. With regard to this rather rare use case, I think we don't have sufficient resources right now to implement this. Maybe it might be easier to add this specific annotation manually to your genome?

@marade
Copy link
Author

marade commented Jun 15, 2023

I did consider something like that. For example things like GFF-Adder exist:

https://github.com/NickJD/ORForise#gff-adder

But I really need the genes added for all the annotation files. JSON, FNA, FFN, GFF3, etc. What if I coded it and submitted a pull request? The design might be as follows:

  • Add --trusted-proteins flag.
  • The new flag would accept a tab-delimited input file (or perhaps JSON?) that can be read into a data structure that can be run through create_cdss(genes, contig) in bakta/features
    /cds.py.
  • Then we simply add a line to inject the trusted genes in main.py (as mentioned above) if the flag has been used.

I feel I could code this kind of design fairly quickly. What do you think?

@oschwengers
Copy link
Owner

Ok, since more and more users ask for a feature like this, I've started a new issue #250 collecting ideas and requirements. Feedback and input is highly welcome!

@oschwengers oschwengers modified the milestones: Backlog, v1.9.0 Nov 27, 2023
@oschwengers
Copy link
Owner

OK, I think #250 addresses the 1st use case described above: you can now provide CDS coordindates as GFF/Genbank via --regions and if required or desired, in addition, provide accompanying functional information via --proteins.

The 2nd use case (finding CDS by aa-seq homolgy) is much more complex as discussed here: #250 (comment).

To keep this lean in terms of these distinct use cases /issues, I'd close this for now. But please, do not hesitate to re-open this or another issue. Thanks for #260 - I'll come back to it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants