[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

mkerin · 2023-11-03T15:56:48Z

Hello,

I have tried to common sense check the output of dnaapler on the provided test data by running the following

dnaapler all -i tests/test_data/SAOMS1.fasta -o output.all -f

In case it is helpful I have attached a .zip folder of the output that I get from this command.

Two things are surprising to me:

We get logs from running the BLAST external dependency, but only the log for the standard error is populated (output.all/logs/blastx_7d5612cc02ea76c324567a7b1bfeb1dfe9d64fbfc4b176959385cdc60514d626.err)
The primary log states that top blastx hit ... not begin with a valid start codon and dnaapler falls back to reorientating the sequence using pyrodigal

Could you explain why BLAST is producing errors in (1), and then the BLAST ouptut is being ignored in (2)? I worry that (1) implies that BLAST has hard-crashed, and then the error has been swallowed because dnaapler has fallen back to using pyrodigal.

output.all.zip

The text was updated successfully, but these errors were encountered:

gbouras13 · 2023-11-06T00:39:48Z

Re openjournals/joss-reviews#5968

The run worked as expected for the test phage.

To explain more fully:

Point 1.

Everytime blastx is run in dnaapler, it uses the class ExternalTool class defined in external_tools.py (https://github.com/gbouras13/dnaapler/blob/main/src/dnaapler/utils/external_tools.py). This is taken (with some modifications) from tbpore https://github.com/mbhall88/tbpore/blob/main/tbpore/external_tools.py.

For any external tool (in dnaapler, this will only be blastx), this will automatically write stdout to the .out file if it is not redirected otherwise by the tool, and the stderr to the .err file.

Because blastx specifies an output file using -out, all stdout is redirected to that file and so the out log file is empty. Therefore only the .err file is populated. In what you have attached, the blastx output file is dnaapler_blast_output.txt and is correct. This is quirk of blastx - what is in that file is the stderr (which is just the command line).

In terms of your concern about blastx crashing, the ExternalTool class has built in exception handling. If a subprocess error is detected, then it will trigger a system exit and tell the user of the error (in red text as well). Given the FASTA validation checks, blastx installation checks and unit tests, I anticipate this will be rare, but if it occurs, this error will be handled.

Point 2.

Prior to v0.4.0, dnaapler would look for a blastx hit alignment from the output file that started with a valid stop codon "M", "V", or "L". If it could not find a start codon, it would notify the user that a hit was found but without a valid start codon and exit, returning the input FASTA as the output.

However, following this issue #44, some more testing by one of the co-authors on her data and a realisation this would exclude many sequences with distant homology to the databases (we anticipated this would be a particular problem for novel phages and plasmids where near perfect matches would be less likely to be in the databases), in v0.4.0 the logic has been changed. (see here for the current implementation):

dnaapler/src/dnaapler/utils/all.py

Line 238 in 7c67b3b

if filtered_df["qseq"][0][0] in ["M", "V", "L"] and (

and

dnaapler/src/dnaapler/utils/processing.py

Line 292 in 7c67b3b

def reorient_single_record_bulk(

Now, dnaapler will still look at the top blastx hit as previously. If the alignment does begin with a valid start codon, dnaapler proceeds as usual.

If the alignment doesn't begin with a valid start codon, then instead of warning and exiting, pyrodigal is run to predict all the CDS on the contig.

Dnaapler then calculates the CDS that has the most overlap with the blastx top hit, and chooses that CDS to reorient the sequence with. This should be the CDS most likely to be the terL/repA/dnaA.

In the attached case, that is what happened with the test phage - it chose a CDS starting at 4183 on the forward strand as terL, and choose that to reorient the phage with.

mkerin · 2023-11-06T11:24:30Z

Thank you @gbouras13 . In response to the comments above:

Okay, I understand why the .out file is empty. But why is blastx producing any stderr output at all? If you run the same blastx command by hand (interative terminal outside of the ExternalTool class) does it produce stderr output?
Okay, I can see that there are cases when we want to fall back gracefully. Just to check - does the test suite include a case where blastx returns a valid hit and dnaapler doesn't have to fall back to prodigal?

gbouras13 · 2023-11-07T01:22:04Z

Hi @mkerin ,

I have looked into this and if you run blastx just in the terminal ( e.g. with 2> blast.err) it captures no stderr. This prompted me to look deeper, and I realised this was caused by a little bug, namely this print statement:

print(f"Command line: {self.command_as_str}", file=stderr_fh)

dnaapler/src/dnaapler/utils/external_tools.py

Line 49 in 637429c

print(f"Command line: {self.command_as_str}", file=stderr_fh)

When I remove that print statement, it results in a blank .stderr. Thanks for picking this up! I've made a commit to address this and remove the statement.

Yes there are tests for both scenarios - see unit tests here specifically to parse blast output and check that this is being captured

dnaapler/tests/test_dnaapler.py

Line 75 in 637429c

class TestReorientSequence(unittest.TestCase):

and end-to-end tests here(all of them running on chromosome.fasta will cover the scenario where there is no fallback)

dnaapler/tests/test_overall.py

Line 64 in 637429c

def test_chrom(tmp_dir):

George

gbouras13 · 2023-11-07T01:24:37Z

I accidentally changed check_call to call in the first commit - I reverted in the second (sorry for the confusion).

Also I needed to change a unit test afterwards (it passes CI now).

mkerin · 2023-11-08T11:28:41Z

Hi @gbouras13,

Thank you for the fix! I am happy that this has been resolved - closing the issue now.

mkerin · 2023-11-08T11:38:46Z

Sorry - I have one more question on this subject.

Just for my own edification, why do you only consider the top blast hit before falling back to pyrodigal. Is it possible that blast returns >1 hit, where the second hit has a valid start codon?

gbouras13 · 2023-11-08T22:54:15Z

Thanks for this @mkerin it's a very good question!

In almost all cases where there is a lower BLAST hit with a valid start codon, the same CDS will be covered by the top hit (based on our tests). As a toy example, if a terL was 600 AA long, you could have a top BLAST hit from 200-600AA, and lower hit from 1-100AA, while 100-200AA has low homology to the dnaapler database and so returns no hit. In this case, the correct CDS would be found using the pyrodigal fallback method, as the correct CDS will have the most overlap with the best hit anyway (from 200-600AA). I anticipate this case will cover 90%+ of the cases where a lower hit begins with a valid start codon.

Chromosomes, phages and plasmids should have only one of dnaA/repA/terL (of course there's exceptions to every rule because this is biology but as a general rule this holds).

So in the case where a replicon has more than 1 dnaA/terL/repA CDS in the genome, then presumably both should have BLAST hits (whether or not they start with a valid start codon). Because the top hit will be better (usually both in terms of length and homology), the pyrodigal fallback would choose that CDS with the best match regardless of start codon which I think is desired behaviour - to be honest, in this case it doesn't really matter which one is chosen, as it is unclear which one is 'correct'.

The other edge case I can conceive of is where the original dnaA/terL/repA CDS is disrupted and is annotated as 2 different CDS - in fact that is what is occurring in the SAOMS1 phage test that you ran in the original comment in this thread (which I presume was a reason behind this question). This is a tricky edge case example hence why I included it in the tests.

In this case, the terL is disrupted by a mobile genetic element (HNH endonuclease). I've attached sample pharokka output below from the .gff file, each row is a CDS. The start coordinate is the column after the 'CDS', the stop is the column after that. 'product' = the annotated function according to pharokka.

MW460250_1_subset	Pyrodigal_3.1.0	CDS	2580	2777	26.9	+	0	ID=ACPGARRA_CDS_0009;transl_table=11;phrog=5340;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0009;function=head and packaging;product=**terminase large subunit**
MW460250_1_subset	Pyrodigal_3.1.0	CDS	3071	4042	140.8	+	0	ID=ACPGARRA_CDS_0010;transl_table=11;phrog=559;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0010;function=DNA, RNA and nucleotide metabolism;product=**HNH endonuclease**
MW460250_1_subset	Pyrodigal_3.1.0	CDS	4183	5730	232.5	+	0	ID=ACPGARRA_CDS_0011;transl_table=11;phrog=675;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0011;function=head and packaging;product=**terminase large subunit**

In this case, there are 2 CDS annotated as the terL - and in your attached BLAST output, the top hit is the second CDS beginning 4183 of length 540AA, and the second hit (starting with a valid start codon ) begins with 2580 length 65AA.

With the current logic, dnaapler will choose the second CDS using the fallback. I think this is actually the correct approach for this phage (knowing a bit (not much!) about phage biology suggests that the 4183 CDS is much more likely to be the 'real' truncated terL in this phage due to its length and the domains preserved). Generally speaking if the disruption is early in the CDS, then the longer second hit (with no valid start codon in the BLAST hit) will likely be better.

In other cases where disruption happens, this may not be the case. If the disruption happens later in the original CDS, then the top hit (likely with a valid start codon) will be chosen as the longer/best hit of the 2 CDS.

To summarise, I think the fallback implemented is the best approach in most cases - but not perfect of course!

George

mkerin · 2023-11-14T22:34:45Z

Thank you for the comprehensive response to this. I am happy to mark the issue as resolved.

gbouras13 mentioned this issue Nov 6, 2023

[JOSS Review] - Text corrections and suggestions #47

Closed

3 tasks

gbouras13 added a commit that referenced this issue Nov 7, 2023

address #52 to remove stderr print bug

0c1ca2d

gbouras13 added a commit that referenced this issue Nov 7, 2023

address #52 revert to check_call

28d1576

mkerin closed this as completed Nov 8, 2023

mkerin reopened this Nov 8, 2023

mkerin closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

mkerin commented Nov 3, 2023

gbouras13 commented Nov 6, 2023 •

edited

Loading

mkerin commented Nov 6, 2023

gbouras13 commented Nov 7, 2023

gbouras13 commented Nov 7, 2023 •

edited

Loading

mkerin commented Nov 8, 2023

mkerin commented Nov 8, 2023

gbouras13 commented Nov 8, 2023 •

edited

Loading

mkerin commented Nov 14, 2023

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

Comments

mkerin commented Nov 3, 2023

gbouras13 commented Nov 6, 2023 • edited Loading

mkerin commented Nov 6, 2023

gbouras13 commented Nov 7, 2023

gbouras13 commented Nov 7, 2023 • edited Loading

mkerin commented Nov 8, 2023

mkerin commented Nov 8, 2023

gbouras13 commented Nov 8, 2023 • edited Loading

mkerin commented Nov 14, 2023

gbouras13 commented Nov 6, 2023 •

edited

Loading

gbouras13 commented Nov 7, 2023 •

edited

Loading

gbouras13 commented Nov 8, 2023 •

edited

Loading