When using --proteins, are the sequences only used for annotation? #216

marade · 2023-06-09T22:46:52Z

Greetings. When running Bakta, if Prodigal scores an ORF too low and rejects it, but the correct sequence has been included with --proteins, will Bakta find and annotate the sequence? It appears the answer is no, but perhaps I'm missing something. I'd like to be able to find and annotate some ORFs that Prodigal is rejecting with a score that's too low.

oschwengers · 2023-06-12T09:39:00Z

Hi @marade,
no, Bakta works on the de novo predicted genes from Prodigal/Pyrodigal and then applies several structural filters (overlaps with other feature types, Antifam, etc) to discard some false positives. It does not work on lower scored sequences from Prodigal - albeit I see the point and this might be interesting in some cases.

What Bakta does, it extracts short ORFs (<=30 aa) itself. However, a search for user proteins is currently not yet implemented for sORFs. So if your looking for those, then I could add this to my list.

marade · 2023-06-12T18:24:09Z

Unfortunately my protein, which is in GenBank in two different forms, is 172 aa or 184 aa long, so it wouldn't qualify as a sORF. When I run Prodigal with '-s' the Total score for the ORF is negative, and judging by the accepted ORFs the Total needs to be positive. Any other ideas on how I might work around this issue?

marade · 2023-06-12T18:27:24Z

Somewhat related, I am noticing that Pyrodigal looks like a nice drop-in replacement for Prodigal. You might consider using it for additional flexibility and efficiency.

https://pypi.org/project/pyrodigal/

marade · 2023-06-12T18:43:35Z

Oh sorry, looks like you're already using Pyrodigal. So much for that idea!

oschwengers · 2023-06-14T11:38:51Z

If you're just interested in these special protein sequences and not in the entire genome, you could predict them using Py/Prodigal and then annotate them using bakta_proteins: https://github.com/oschwengers/bakta#protein-bulk-annotation

marade · 2023-06-14T16:48:51Z

I'm focused on whole genomes, so would need something for those. I suppose this is turning into a feature request? Ideally what I'd like is to be able to add some genes to the cdss variable after this line in main.py:

cdss = feat_cds.predict(genome, contigs_path)

Another, perhaps more difficult, approach would be to try to get Pyrodigal to accept lower scoring putative ORFs. I've already tried training my own Prodigal model, but that didn't get my genes recognized and there are no easy ways to adjust the training other than by feeding it different sequences.

oschwengers · 2023-06-15T08:59:51Z

I totally see your point. However, I think this would require some fairly complex changes to both, the UI and internal logic. With regard to this rather rare use case, I think we don't have sufficient resources right now to implement this. Maybe it might be easier to add this specific annotation manually to your genome?

marade · 2023-06-15T17:06:49Z

I did consider something like that. For example things like GFF-Adder exist:

https://github.com/NickJD/ORForise#gff-adder

But I really need the genes added for all the annotation files. JSON, FNA, FFN, GFF3, etc. What if I coded it and submitted a pull request? The design might be as follows:

Add --trusted-proteins flag.
The new flag would accept a tab-delimited input file (or perhaps JSON?) that can be read into a data structure that can be run through create_cdss(genes, contig) in bakta/features
/cds.py.
Then we simply add a line to inject the trusted genes in main.py (as mentioned above) if the flag has been used.

I feel I could code this kind of design fairly quickly. What do you think?

oschwengers · 2023-10-24T07:44:11Z

Ok, since more and more users ask for a feature like this, I've started a new issue #250 collecting ideas and requirements. Feedback and input is highly welcome!

oschwengers · 2023-11-27T15:30:05Z

OK, I think #250 addresses the 1st use case described above: you can now provide CDS coordindates as GFF/Genbank via --regions and if required or desired, in addition, provide accompanying functional information via --proteins.

The 2nd use case (finding CDS by aa-seq homolgy) is much more complex as discussed here: #250 (comment).

To keep this lean in terms of these distinct use cases /issues, I'd close this for now. But please, do not hesitate to re-open this or another issue. Thanks for #260 - I'll come back to it later.

oschwengers added question Further information is requested feature labels Jul 27, 2023

oschwengers added this to the Backlog milestone Aug 22, 2023

simone-pignotti mentioned this issue Oct 9, 2023

Transfer annotations from similar genome #247

Open

oschwengers mentioned this issue Oct 24, 2023

Add import feature for user-provided regions and/or features #250

Closed

oschwengers modified the milestones: Backlog, v1.9.0 Nov 27, 2023

oschwengers closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using --proteins, are the sequences only used for annotation? #216

When using --proteins, are the sequences only used for annotation? #216

marade commented Jun 9, 2023

oschwengers commented Jun 12, 2023

marade commented Jun 12, 2023

marade commented Jun 12, 2023

marade commented Jun 12, 2023

oschwengers commented Jun 14, 2023

marade commented Jun 14, 2023 •

edited

Loading

oschwengers commented Jun 15, 2023

marade commented Jun 15, 2023

oschwengers commented Oct 24, 2023

oschwengers commented Nov 27, 2023

When using --proteins, are the sequences only used for annotation? #216

When using --proteins, are the sequences only used for annotation? #216

Comments

marade commented Jun 9, 2023

oschwengers commented Jun 12, 2023

marade commented Jun 12, 2023

marade commented Jun 12, 2023

marade commented Jun 12, 2023

oschwengers commented Jun 14, 2023

marade commented Jun 14, 2023 • edited Loading

oschwengers commented Jun 15, 2023

marade commented Jun 15, 2023

oschwengers commented Oct 24, 2023

oschwengers commented Nov 27, 2023

marade commented Jun 14, 2023 •

edited

Loading