-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52
Comments
Re openjournals/joss-reviews#5968 The run worked as expected for the test phage. To explain more fully: Point 1. Everytime blastx is run in dnaapler, it uses the class ExternalTool class defined in external_tools.py (https://github.com/gbouras13/dnaapler/blob/main/src/dnaapler/utils/external_tools.py). This is taken (with some modifications) from tbpore https://github.com/mbhall88/tbpore/blob/main/tbpore/external_tools.py. For any external tool (in dnaapler, this will only be blastx), this will automatically write stdout to the Because blastx specifies an output file using In terms of your concern about blastx crashing, the ExternalTool class has built in exception handling. If a subprocess error is detected, then it will trigger a system exit and tell the user of the error (in red text as well). Given the FASTA validation checks, blastx installation checks and unit tests, I anticipate this will be rare, but if it occurs, this error will be handled. Point 2. Prior to v0.4.0, dnaapler would look for a blastx hit alignment from the output file that started with a valid stop codon "M", "V", or "L". If it could not find a start codon, it would notify the user that a hit was found but without a valid start codon and exit, returning the input FASTA as the output. However, following this issue #44, some more testing by one of the co-authors on her data and a realisation this would exclude many sequences with distant homology to the databases (we anticipated this would be a particular problem for novel phages and plasmids where near perfect matches would be less likely to be in the databases), in v0.4.0 the logic has been changed. (see here for the current implementation): dnaapler/src/dnaapler/utils/all.py Line 238 in 7c67b3b
and dnaapler/src/dnaapler/utils/processing.py Line 292 in 7c67b3b
Now, dnaapler will still look at the top blastx hit as previously. If the alignment does begin with a valid start codon, dnaapler proceeds as usual. If the alignment doesn't begin with a valid start codon, then instead of warning and exiting, pyrodigal is run to predict all the CDS on the contig. Dnaapler then calculates the CDS that has the most overlap with the blastx top hit, and chooses that CDS to reorient the sequence with. This should be the CDS most likely to be the terL/repA/dnaA. In the attached case, that is what happened with the test phage - it chose a CDS starting at 4183 on the forward strand as terL, and choose that to reorient the phage with. |
Thank you @gbouras13 . In response to the comments above:
|
Hi @mkerin ,
When I remove that print statement, it results in a blank .stderr. Thanks for picking this up! I've made a commit to address this and remove the statement.
dnaapler/tests/test_dnaapler.py Line 75 in 637429c
and end-to-end tests here(all of them running on chromosome.fasta will cover the scenario where there is no fallback) dnaapler/tests/test_overall.py Line 64 in 637429c
George |
I accidentally changed check_call to call in the first commit - I reverted in the second (sorry for the confusion). Also I needed to change a unit test afterwards (it passes CI now). |
Hi @gbouras13, Thank you for the fix! I am happy that this has been resolved - closing the issue now. |
Sorry - I have one more question on this subject. Just for my own edification, why do you only consider the top blast hit before falling back to pyrodigal. Is it possible that blast returns >1 hit, where the second hit has a valid start codon? |
Thanks for this @mkerin it's a very good question! In almost all cases where there is a lower BLAST hit with a valid start codon, the same CDS will be covered by the top hit (based on our tests). As a toy example, if a terL was 600 AA long, you could have a top BLAST hit from 200-600AA, and lower hit from 1-100AA, while 100-200AA has low homology to the dnaapler database and so returns no hit. In this case, the correct CDS would be found using the pyrodigal fallback method, as the correct CDS will have the most overlap with the best hit anyway (from 200-600AA). I anticipate this case will cover 90%+ of the cases where a lower hit begins with a valid start codon. Chromosomes, phages and plasmids should have only one of dnaA/repA/terL (of course there's exceptions to every rule because this is biology but as a general rule this holds). So in the case where a replicon has more than 1 dnaA/terL/repA CDS in the genome, then presumably both should have BLAST hits (whether or not they start with a valid start codon). Because the top hit will be better (usually both in terms of length and homology), the pyrodigal fallback would choose that CDS with the best match regardless of start codon which I think is desired behaviour - to be honest, in this case it doesn't really matter which one is chosen, as it is unclear which one is 'correct'. The other edge case I can conceive of is where the original dnaA/terL/repA CDS is disrupted and is annotated as 2 different CDS - in fact that is what is occurring in the SAOMS1 phage test that you ran in the original comment in this thread (which I presume was a reason behind this question). This is a tricky edge case example hence why I included it in the tests. In this case, the terL is disrupted by a mobile genetic element (HNH endonuclease). I've attached sample pharokka output below from the .gff file, each row is a CDS. The start coordinate is the column after the 'CDS', the stop is the column after that. 'product' = the annotated function according to pharokka.
In this case, there are 2 CDS annotated as the terL - and in your attached BLAST output, the top hit is the second CDS beginning 4183 of length 540AA, and the second hit (starting with a valid start codon ) begins with 2580 length 65AA. With the current logic, dnaapler will choose the second CDS using the fallback. I think this is actually the correct approach for this phage (knowing a bit (not much!) about phage biology suggests that the 4183 CDS is much more likely to be the 'real' truncated terL in this phage due to its length and the domains preserved). Generally speaking if the disruption is early in the CDS, then the longer second hit (with no valid start codon in the BLAST hit) will likely be better. In other cases where disruption happens, this may not be the case. If the disruption happens later in the original CDS, then the top hit (likely with a valid start codon) will be chosen as the longer/best hit of the 2 CDS. To summarise, I think the fallback implemented is the best approach in most cases - but not perfect of course! George |
Thank you for the comprehensive response to this. I am happy to mark the issue as resolved. |
Re openjournals/joss-reviews#5968
Hello,
I have tried to common sense check the output of
dnaapler
on the provided test data by running the followingIn case it is helpful I have attached a .zip folder of the output that I get from this command.
Two things are surprising to me:
output.all/logs/blastx_7d5612cc02ea76c324567a7b1bfeb1dfe9d64fbfc4b176959385cdc60514d626.err
)top blastx hit ... not begin with a valid start codon
and dnaapler falls back to reorientating the sequence using pyrodigalCould you explain why BLAST is producing errors in (1), and then the BLAST ouptut is being ignored in (2)? I worry that (1) implies that BLAST has hard-crashed, and then the error has been swallowed because
dnaapler
has fallen back to usingpyrodigal
.output.all.zip
The text was updated successfully, but these errors were encountered: