Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal #52

Closed
mkerin opened this issue Nov 3, 2023 · 8 comments
Closed

Comments

@mkerin
Copy link

mkerin commented Nov 3, 2023

Re openjournals/joss-reviews#5968

Hello,

I have tried to common sense check the output of dnaapler on the provided test data by running the following

dnaapler all -i tests/test_data/SAOMS1.fasta -o output.all -f

In case it is helpful I have attached a .zip folder of the output that I get from this command.

Two things are surprising to me:

  1. We get logs from running the BLAST external dependency, but only the log for the standard error is populated (output.all/logs/blastx_7d5612cc02ea76c324567a7b1bfeb1dfe9d64fbfc4b176959385cdc60514d626.err)
  2. The primary log states that top blastx hit ... not begin with a valid start codon and dnaapler falls back to reorientating the sequence using pyrodigal

Could you explain why BLAST is producing errors in (1), and then the BLAST ouptut is being ignored in (2)? I worry that (1) implies that BLAST has hard-crashed, and then the error has been swallowed because dnaapler has fallen back to using pyrodigal.

output.all.zip

@gbouras13
Copy link
Owner

gbouras13 commented Nov 6, 2023

Re openjournals/joss-reviews#5968

The run worked as expected for the test phage.

To explain more fully:

Point 1.

Everytime blastx is run in dnaapler, it uses the class ExternalTool class defined in external_tools.py (https://github.com/gbouras13/dnaapler/blob/main/src/dnaapler/utils/external_tools.py). This is taken (with some modifications) from tbpore https://github.com/mbhall88/tbpore/blob/main/tbpore/external_tools.py.

For any external tool (in dnaapler, this will only be blastx), this will automatically write stdout to the .out file if it is not redirected otherwise by the tool, and the stderr to the .err file.

Because blastx specifies an output file using -out, all stdout is redirected to that file and so the out log file is empty. Therefore only the .err file is populated. In what you have attached, the blastx output file is dnaapler_blast_output.txt and is correct. This is quirk of blastx - what is in that file is the stderr (which is just the command line).

In terms of your concern about blastx crashing, the ExternalTool class has built in exception handling. If a subprocess error is detected, then it will trigger a system exit and tell the user of the error (in red text as well). Given the FASTA validation checks, blastx installation checks and unit tests, I anticipate this will be rare, but if it occurs, this error will be handled.

Point 2.

Prior to v0.4.0, dnaapler would look for a blastx hit alignment from the output file that started with a valid stop codon "M", "V", or "L". If it could not find a start codon, it would notify the user that a hit was found but without a valid start codon and exit, returning the input FASTA as the output.

However, following this issue #44, some more testing by one of the co-authors on her data and a realisation this would exclude many sequences with distant homology to the databases (we anticipated this would be a particular problem for novel phages and plasmids where near perfect matches would be less likely to be in the databases), in v0.4.0 the logic has been changed. (see here for the current implementation):

if filtered_df["qseq"][0][0] in ["M", "V", "L"] and (

and

def reorient_single_record_bulk(

Now, dnaapler will still look at the top blastx hit as previously. If the alignment does begin with a valid start codon, dnaapler proceeds as usual.

If the alignment doesn't begin with a valid start codon, then instead of warning and exiting, pyrodigal is run to predict all the CDS on the contig.

Dnaapler then calculates the CDS that has the most overlap with the blastx top hit, and chooses that CDS to reorient the sequence with. This should be the CDS most likely to be the terL/repA/dnaA.

In the attached case, that is what happened with the test phage - it chose a CDS starting at 4183 on the forward strand as terL, and choose that to reorient the phage with.

@mkerin
Copy link
Author

mkerin commented Nov 6, 2023

Thank you @gbouras13 . In response to the comments above:

  1. Okay, I understand why the .out file is empty. But why is blastx producing any stderr output at all? If you run the same blastx command by hand (interative terminal outside of the ExternalTool class) does it produce stderr output?
  2. Okay, I can see that there are cases when we want to fall back gracefully. Just to check - does the test suite include a case where blastx returns a valid hit and dnaapler doesn't have to fall back to prodigal?

@gbouras13
Copy link
Owner

Hi @mkerin ,

  1. I have looked into this and if you run blastx just in the terminal ( e.g. with 2> blast.err) it captures no stderr. This prompted me to look deeper, and I realised this was caused by a little bug, namely this print statement:
print(f"Command line: {self.command_as_str}", file=stderr_fh)

print(f"Command line: {self.command_as_str}", file=stderr_fh)

When I remove that print statement, it results in a blank .stderr. Thanks for picking this up! I've made a commit to address this and remove the statement.

  1. Yes there are tests for both scenarios - see unit tests here specifically to parse blast output and check that this is being captured

class TestReorientSequence(unittest.TestCase):

and end-to-end tests here(all of them running on chromosome.fasta will cover the scenario where there is no fallback)

def test_chrom(tmp_dir):

George

gbouras13 added a commit that referenced this issue Nov 7, 2023
gbouras13 added a commit that referenced this issue Nov 7, 2023
@gbouras13
Copy link
Owner

gbouras13 commented Nov 7, 2023

I accidentally changed check_call to call in the first commit - I reverted in the second (sorry for the confusion).

Also I needed to change a unit test afterwards (it passes CI now).

@mkerin
Copy link
Author

mkerin commented Nov 8, 2023

Hi @gbouras13,

Thank you for the fix! I am happy that this has been resolved - closing the issue now.

@mkerin mkerin closed this as completed Nov 8, 2023
@mkerin mkerin reopened this Nov 8, 2023
@mkerin
Copy link
Author

mkerin commented Nov 8, 2023

Sorry - I have one more question on this subject.

Just for my own edification, why do you only consider the top blast hit before falling back to pyrodigal. Is it possible that blast returns >1 hit, where the second hit has a valid start codon?

@gbouras13
Copy link
Owner

gbouras13 commented Nov 8, 2023

Thanks for this @mkerin it's a very good question!

In almost all cases where there is a lower BLAST hit with a valid start codon, the same CDS will be covered by the top hit (based on our tests). As a toy example, if a terL was 600 AA long, you could have a top BLAST hit from 200-600AA, and lower hit from 1-100AA, while 100-200AA has low homology to the dnaapler database and so returns no hit. In this case, the correct CDS would be found using the pyrodigal fallback method, as the correct CDS will have the most overlap with the best hit anyway (from 200-600AA). I anticipate this case will cover 90%+ of the cases where a lower hit begins with a valid start codon.

Chromosomes, phages and plasmids should have only one of dnaA/repA/terL (of course there's exceptions to every rule because this is biology but as a general rule this holds).

So in the case where a replicon has more than 1 dnaA/terL/repA CDS in the genome, then presumably both should have BLAST hits (whether or not they start with a valid start codon). Because the top hit will be better (usually both in terms of length and homology), the pyrodigal fallback would choose that CDS with the best match regardless of start codon which I think is desired behaviour - to be honest, in this case it doesn't really matter which one is chosen, as it is unclear which one is 'correct'.

The other edge case I can conceive of is where the original dnaA/terL/repA CDS is disrupted and is annotated as 2 different CDS - in fact that is what is occurring in the SAOMS1 phage test that you ran in the original comment in this thread (which I presume was a reason behind this question). This is a tricky edge case example hence why I included it in the tests.

In this case, the terL is disrupted by a mobile genetic element (HNH endonuclease). I've attached sample pharokka output below from the .gff file, each row is a CDS. The start coordinate is the column after the 'CDS', the stop is the column after that. 'product' = the annotated function according to pharokka.

MW460250_1_subset	Pyrodigal_3.1.0	CDS	2580	2777	26.9	+	0	ID=ACPGARRA_CDS_0009;transl_table=11;phrog=5340;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0009;function=head and packaging;product=**terminase large subunit**
MW460250_1_subset	Pyrodigal_3.1.0	CDS	3071	4042	140.8	+	0	ID=ACPGARRA_CDS_0010;transl_table=11;phrog=559;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0010;function=DNA, RNA and nucleotide metabolism;product=**HNH endonuclease**
MW460250_1_subset	Pyrodigal_3.1.0	CDS	4183	5730	232.5	+	0	ID=ACPGARRA_CDS_0011;transl_table=11;phrog=675;top_hit=No_MMseqs_PHROG_hit;locus_tag=ACPGARRA_CDS_0011;function=head and packaging;product=**terminase large subunit**

In this case, there are 2 CDS annotated as the terL - and in your attached BLAST output, the top hit is the second CDS beginning 4183 of length 540AA, and the second hit (starting with a valid start codon ) begins with 2580 length 65AA.

With the current logic, dnaapler will choose the second CDS using the fallback. I think this is actually the correct approach for this phage (knowing a bit (not much!) about phage biology suggests that the 4183 CDS is much more likely to be the 'real' truncated terL in this phage due to its length and the domains preserved). Generally speaking if the disruption is early in the CDS, then the longer second hit (with no valid start codon in the BLAST hit) will likely be better.

In other cases where disruption happens, this may not be the case. If the disruption happens later in the original CDS, then the top hit (likely with a valid start codon) will be chosen as the longer/best hit of the 2 CDS.

To summarise, I think the fallback implemented is the best approach in most cases - but not perfect of course!

George

@mkerin
Copy link
Author

mkerin commented Nov 14, 2023

Thank you for the comprehensive response to this. I am happy to mark the issue as resolved.

@mkerin mkerin closed this as completed Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants