Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reversions/back-mutations as within-Auspice-computed branch label #1444

Closed
corneliusroemer opened this issue Jan 17, 2022 · 8 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@corneliusroemer
Copy link
Member

Context
Reference backfilling is a big problem in SARS-CoV-2 sequences. All the information one needs to identify reversions back to reference is included in the auspice.json. This would for example allow me to quickly check that a Nextclade reference tree doesn't contain any reversions.

Description
As a user, I would like to be able to see nucleotide reversions (either only to reference, or to any previous state) be highlightable on the tree. For example as a branch label, like we do with clades or sometimes Spike mutations.

Examples
Usher already implements this feature, they must do it in the backend, so there's clearly some interest in this feature beyond me.
image

Possible solution
I could write a custom Python script that post-processes an auspice.json to add this as a branch annotation. But it's silly to do this with a script when it could be implemented within auspice.json for all trees, for all users.

@corneliusroemer corneliusroemer added the enhancement New feature or request label Jan 17, 2022
@jameshadfield
Copy link
Member

jameshadfield commented Jan 19, 2022

Thanks @corneliusroemer — I completely agree and think this feature will immensely help with interpreting trees, especially Omicron. I’m going to expand this issue slightly to encompass changes we've discussed regarding display of mutations more generally.

Current situation for branch labels
Branch labels must be defined within the dataset JSON, and we typically do this for clade and AA changes. Auspice only contains one piece of special behavior here - if the branch label key is aa then we selectively display the labels to avoid showing thousands of labels!

"branch_attrs": {
    "labels": {
        "aa": "ORF8: L84S",
        "clade": "19B",

Proposal for branch labels

Simplest (and most realistic short-term) would be a small augur script within nCoV. The better long-term solution would be to compute this within augur ancestral and augur translate and allow them to define branch labels which are subsequently exported. See nextstrain/augur#720 for a proposal of how to define branch labels in node_data JSONs.

Current situation for mutation display
Currently dataset JSONs report mutations on a branch per-nucleotide and per-gene. This data typically comes from augur ancestral and augur translate, respectively, although for nCoV we are using nextclade for the AA changes. Whether Ns are included is influenced by parameters to those augur commands. The JSON structure looks like so:

"branch_attrs": {
    "mutations": {
        "nuc": [ "T1N", "T2N", ...],
        "S": ["T716I"]

The tooltips used in auspice behave as follows:

tooltip mutations shown Ns gaps (deletions) insertions
branch / tip hover subset listed, Ns ignored count shown treated as mut¹ N/A²
branch shift+click all shown, incl gaps treated as muts treated as mut¹ N/A²
tip click all listed w.r.t. root³ click to copy list treated as mut¹ N/A²

¹ No grouping is performed, e.g. if we have deletions of pos a, a+1, a+2 then we report three "mutations" (to -).
² we have no way for auspice to parse insertions!
³ reversions are removed from mutations listed here

Proposed Display of Mutations
We'd like to be able to display mutations, deletions and insertions grouped by certain categories, however where to draw the boundaries isn't clear:

  1. Homoplasies
  2. Reversions to parent state. These may also be homoplasies!
  3. Reversions to root state (which we assume is the reference used for basecalling).
  4. Novel (mutation has only happened once and is not a reversion to the root).

(Detecting runs of deletions/insertions which are homoplasic isn't trivial, but it is if we consider them as a series of individual events, as we currently do.)

These could be computed within auspice itself, unless there is some reason to leverage nextclade for this?

Relatedly, we should definitely move towards the aesthetics employed by nextclade for displaying mutations with badges!

What about Insertions?

It'd be wise to consider how insertions could be provided here, but this may be worthy of a separate issue (and shouldn't hold up implementing the previous sections). VCF-like style would be <FROM><POS><TO> where <TO> is >1 characters and includes <FROM>. For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2). However it's worth noting our deletion syntax doesn't follow VCF style. A style which follows our deletion syntax might be more along the lines of 3TAG. It's not clear to me how to reference subsequent changes in the insertion (e.g. if a later event modifies the inserted bases).

@AngieHinrichs
Copy link

For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2).

My 2c: please don't follow VCF off that cliff. :) I really, really wish VCF didn't include the base to the left of indels. It's distracting to include a base that does not change, it necessitated an additional special rule for insertions at the beginning of the sequence (the unchanging base to the right must be appended on the right, further ugh), and it complicates code that has to translate between VCF and other formats (for example requiring reference sequence input to convert to VCF when it would otherwise be unnecessary). The empty string is a perfectly valid <FROM> for a point insertion IMO. :) Some formats use "-" to avoid using the empty string. There are multiple better alternatives to VCF's base-to-the-left. </rant>

The rest of it sounds great! :)

@emmahodcroft
Copy link
Member

I only worked with VCFs for a short while a few years ago but I'd second Angie here, it drove me nuts!

@rneher
Copy link
Member

rneher commented Jan 21, 2022

in nextclade, we have use <position-before-insertion><inserted-sequence>, like list insertion conventions. that position to the left can be 0 (one-based indexing) or -1 (zero-based indexing) when the insertion precedes the reference.

@jameshadfield jameshadfield self-assigned this Jan 26, 2022
@jameshadfield
Copy link
Member

Update:

I have this working for the on-click info panel, just need to extend it to the on-hover panel as well. I think subsequent PRs can then

  • collect runs of Ns / gaps into one visual element
  • implement the nice badges from Nextclade (I've got a proof of principle working here)
  • consider insertions. This has to start in augur I think.

image
image

@emmahodcroft
Copy link
Member

This looks super awesome and incredibly useful James!

jameshadfield added a commit that referenced this issue Jan 28, 2022
These changes were motivated by issue #1444 [1] where separating
mutations into categories can aid both QC and biological interpretation.
I chose to use "mutations" to refer to mutations observed on a branch
and "changes" to refer to the collection of mutations between a tip
and the root.

The categories are not necessarily disjoint, as a mutation back
to the root will also be a homoplasy or a unique mutation.

Note that changes between a tip sequence and the root aren't grouped
into homoplasies, as a single change (A→C) may be the result of multiple
mutations (e.g. A→B→C) and thus we would need to check the tip state of
each position which is difficult with the current code.

On-hover panels are left unchanged in this commit.

[1] #1444
@victorlin victorlin moved this from New to In Progress in Nextstrain planning (archived) Feb 2, 2022
jameshadfield added a commit that referenced this issue Feb 14, 2022
These changes were motivated by issue #1444 [1] where separating
mutations into categories can aid both QC and biological interpretation.
I chose to use "mutations" to refer to mutations observed on a branch
and "changes" to refer to the collection of mutations between a tip
and the root.

The categories are not necessarily disjoint, as a mutation back
to the root will also be a homoplasy or a unique mutation.

Note that changes between a tip sequence and the root aren't grouped
into homoplasies, as a single change (A→C) may be the result of multiple
mutations (e.g. A→B→C) and thus we would need to check the tip state of
each position which is difficult with the current code.

On-hover panels are left unchanged in this commit.

[1] #1444
@corneliusroemer
Copy link
Member Author

I think this has been part of Auspice for a while now @jameshadfield

It's a cool feature that's been super useful. But just wanted to check with you this is actually done before closing.

@jameshadfield
Copy link
Member

Closed by #1449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

5 participants