-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using --proteins, are the sequences only used for annotation? #216
Comments
Hi @marade, What Bakta does, it extracts short ORFs (<=30 aa) itself. However, a search for user proteins is currently not yet implemented for sORFs. So if your looking for those, then I could add this to my list. |
Unfortunately my protein, which is in GenBank in two different forms, is 172 aa or 184 aa long, so it wouldn't qualify as a sORF. When I run Prodigal with '-s' the Total score for the ORF is negative, and judging by the accepted ORFs the Total needs to be positive. Any other ideas on how I might work around this issue? |
Somewhat related, I am noticing that Pyrodigal looks like a nice drop-in replacement for Prodigal. You might consider using it for additional flexibility and efficiency. |
Oh sorry, looks like you're already using Pyrodigal. So much for that idea! |
If you're just interested in these special protein sequences and not in the entire genome, you could predict them using Py/Prodigal and then annotate them using |
I'm focused on whole genomes, so would need something for those. I suppose this is turning into a feature request? Ideally what I'd like is to be able to add some genes to the cdss variable after this line in main.py:
Another, perhaps more difficult, approach would be to try to get Pyrodigal to accept lower scoring putative ORFs. I've already tried training my own Prodigal model, but that didn't get my genes recognized and there are no easy ways to adjust the training other than by feeding it different sequences. |
I totally see your point. However, I think this would require some fairly complex changes to both, the UI and internal logic. With regard to this rather rare use case, I think we don't have sufficient resources right now to implement this. Maybe it might be easier to add this specific annotation manually to your genome? |
I did consider something like that. For example things like GFF-Adder exist: https://github.com/NickJD/ORForise#gff-adder But I really need the genes added for all the annotation files. JSON, FNA, FFN, GFF3, etc. What if I coded it and submitted a pull request? The design might be as follows:
I feel I could code this kind of design fairly quickly. What do you think? |
Ok, since more and more users ask for a feature like this, I've started a new issue #250 collecting ideas and requirements. Feedback and input is highly welcome! |
OK, I think #250 addresses the 1st use case described above: you can now provide CDS coordindates as GFF/Genbank via The 2nd use case (finding CDS by aa-seq homolgy) is much more complex as discussed here: #250 (comment). To keep this lean in terms of these distinct use cases /issues, I'd close this for now. But please, do not hesitate to re-open this or another issue. Thanks for #260 - I'll come back to it later. |
Greetings. When running Bakta, if Prodigal scores an ORF too low and rejects it, but the correct sequence has been included with --proteins, will Bakta find and annotate the sequence? It appears the answer is no, but perhaps I'm missing something. I'd like to be able to find and annotate some ORFs that Prodigal is rejecting with a score that's too low.
The text was updated successfully, but these errors were encountered: