Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ancestral] reference seq may be reference or inferred tree root #1362

Open
jameshadfield opened this issue Dec 18, 2023 · 2 comments
Open
Labels
bug Something isn't working priority: low To be resolved after high and moderate priority issues

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 18, 2023

Current Behavior

augur ancestral exports a "reference" sequence via <JSON>.reference.nuc. Depending on the usage this is:

  1. For VCF inputs this is simply the FASTA input --vcf-reference
  2. For FASTA inputs, if --root-sequence (FASTA/GenBank) is provided then we read that (and error if we can't)
  3. For FASTA inputs without --root-sequence then the root sequence is the inferred sequence at the root node. Salient code:

augur/augur/ancestral.py

Lines 157 to 160 in d35f838

if root_sequence:
root_seq = root_sequence
else:
root_seq = tt.sequence(T.root, as_string=True)

Note that the relationship between <JSON>.reference.nuc and the sequence at the root-node is correct in each of those cases. This is important as it implies the (JSON nuc) reference sequence is appropriate to use in the context of a nextclade dataset. See the following tests, respectively:

  1. cram test
  2. cram test
  3. cram test

related issue See also #1361.

Expected behavior

We should clearly distinguish between reference & root-sequence in the JSON key names.

How to reproduce

See above cram tests

Possible solution

Only write <JSON>.reference in cases 1 & 2, where we know that it actually corresponds to a provided reference. For case (3) there can be no inferred mutations on the root node so export can just use the sequence attached to the root rather than simply node_data.reference as it does now.

Your environment: if running Nextstrain locally

augur 23.1.1

@jameshadfield jameshadfield added the bug Something isn't working label Dec 18, 2023
jameshadfield added a commit that referenced this issue Dec 21, 2023
This test was modified to confirm my own understanding of the code which
allowed the next set of commits (modifying `augur translate`). Rather
than drop it, it's helpful to include it in the repo for future
reference. This commit is related to
<#1362>.
jameshadfield added a commit that referenced this issue Dec 21, 2023
For JSON inputs we were previously incorrectly exporting the
root-sequence translations as the "reference". Instead, we now translate
the provided reference (nuc) sequence. (There is some subtlety here
because the provided nuc reference sequence may in fact be the
root-sequence rather than an actual reference, but this is a problem
with `augur ancestral`. See
<#1362> for more details.)

This allows us to compare the reference translation to the root-sequence
translation and thus detail any AA mutations at the root node. A
side-effect of this is that we now always export an array of mutations
for each gene/CDS at the root node, although this may often be empty.

This brings the behaviour of JSON inputs in-line with that of VCF
inputs.
jameshadfield added a commit that referenced this issue Dec 30, 2023
This test was modified to confirm my own understanding of the code which
allowed the next set of commits (modifying `augur translate`). Rather
than drop it, it's helpful to include it in the repo for future
reference. This commit is related to
<#1362>.
jameshadfield added a commit that referenced this issue Dec 30, 2023
For JSON inputs we were previously incorrectly exporting the
root-sequence translations as the "reference". Instead, we now translate
the provided reference (nuc) sequence. (There is some subtlety here
because the provided nuc reference sequence may in fact be the
root-sequence rather than an actual reference, but this is a problem
with `augur ancestral`. See
<#1362> for more details.)

This allows us to compare the reference translation to the root-sequence
translation and thus detail any AA mutations at the root node. A
side-effect of this is that we now always export an array of mutations
for each gene/CDS at the root node, although this may often be empty.

This brings the behaviour of JSON inputs in-line with that of VCF
inputs.
jameshadfield added a commit that referenced this issue Jan 22, 2024
This test was modified to confirm my own understanding of the code which
allowed the next set of commits (modifying `augur translate`). Rather
than drop it, it's helpful to include it in the repo for future
reference. This commit is related to
<#1362>.
jameshadfield added a commit that referenced this issue Jan 22, 2024
For JSON inputs we were previously incorrectly exporting the
root-sequence translations as the "reference". Instead, we now translate
the provided reference (nuc) sequence. (There is some subtlety here
because the provided nuc reference sequence may in fact be the
root-sequence rather than an actual reference, but this is a problem
with `augur ancestral`. See
<#1362> for more details.)

This allows us to compare the reference translation to the root-sequence
translation and thus detail any AA mutations at the root node. A
side-effect of this is that we now always export an array of mutations
for each gene/CDS at the root node, although this may often be empty.

This brings the behaviour of JSON inputs in-line with that of VCF
inputs.
jameshadfield added a commit that referenced this issue Jan 22, 2024
For JSON inputs we were previously incorrectly exporting the
root-sequence translations as the "reference". Instead, we now translate
the provided reference (nuc) sequence. (There is some subtlety here
because the provided nuc reference sequence may in fact be the
root-sequence rather than an actual reference, but this is a problem
with `augur ancestral`. See
<#1362> for more details.)

This allows us to compare the reference translation to the root-sequence
translation and thus detail any AA mutations at the root node. A
side-effect of this is that we now always export an array of mutations
for each gene/CDS at the root node, although this may often be empty.

This brings the behaviour of JSON inputs in-line with that of VCF
inputs.
@jameshadfield jameshadfield added the priority: low To be resolved after high and moderate priority issues label Jan 25, 2024
@jameshadfield
Copy link
Member Author

This bug was not fixed as part of #1355 however this test describes the current behaviour for FASTA inputs with/without --root-sequence, and includes output where the <JSON>.reference.nuc differs.

@jameshadfield
Copy link
Member Author

The above comments focus only on the nucleotide sequence(s). augur ancestral can also export translations for non-VCF inputs. The behaviour is identical, i.e. cases (2,3) hold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: low To be resolved after high and moderate priority issues
Projects
None yet
Development

No branches or pull requests

1 participant