Add reversions/back-mutations as within-Auspice-computed branch label #1444

corneliusroemer · 2022-01-17T13:53:32Z

Context
Reference backfilling is a big problem in SARS-CoV-2 sequences. All the information one needs to identify reversions back to reference is included in the auspice.json. This would for example allow me to quickly check that a Nextclade reference tree doesn't contain any reversions.

Description
As a user, I would like to be able to see nucleotide reversions (either only to reference, or to any previous state) be highlightable on the tree. For example as a branch label, like we do with clades or sometimes Spike mutations.

Examples
Usher already implements this feature, they must do it in the backend, so there's clearly some interest in this feature beyond me.

Possible solution
I could write a custom Python script that post-processes an auspice.json to add this as a branch annotation. But it's silly to do this with a script when it could be implemented within auspice.json for all trees, for all users.

jameshadfield · 2022-01-19T04:34:18Z

Thanks @corneliusroemer — I completely agree and think this feature will immensely help with interpreting trees, especially Omicron. I’m going to expand this issue slightly to encompass changes we've discussed regarding display of mutations more generally.

Current situation for branch labels
Branch labels must be defined within the dataset JSON, and we typically do this for clade and AA changes. Auspice only contains one piece of special behavior here - if the branch label key is aa then we selectively display the labels to avoid showing thousands of labels!

"branch_attrs": {
    "labels": {
        "aa": "ORF8: L84S",
        "clade": "19B",

Proposal for branch labels

Simplest (and most realistic short-term) would be a small augur script within nCoV. The better long-term solution would be to compute this within augur ancestral and augur translate and allow them to define branch labels which are subsequently exported. See nextstrain/augur#720 for a proposal of how to define branch labels in node_data JSONs.

Current situation for mutation display
Currently dataset JSONs report mutations on a branch per-nucleotide and per-gene. This data typically comes from augur ancestral and augur translate, respectively, although for nCoV we are using nextclade for the AA changes. Whether Ns are included is influenced by parameters to those augur commands. The JSON structure looks like so:

"branch_attrs": {
    "mutations": {
        "nuc": [ "T1N", "T2N", ...],
        "S": ["T716I"]

The tooltips used in auspice behave as follows:

tooltip	mutations shown	Ns	gaps (deletions)	insertions
branch / tip hover	subset listed, Ns ignored	count shown	treated as mut¹	N/A²
branch shift+click	all shown, incl gaps	treated as muts	treated as mut¹	N/A²
tip click	all listed w.r.t. root³	click to copy list	treated as mut¹	N/A²

¹ No grouping is performed, e.g. if we have deletions of pos a, a+1, a+2 then we report three "mutations" (to -).
² we have no way for auspice to parse insertions!
³ reversions are removed from mutations listed here

Proposed Display of Mutations
We'd like to be able to display mutations, deletions and insertions grouped by certain categories, however where to draw the boundaries isn't clear:

Homoplasies
Reversions to parent state. These may also be homoplasies!
Reversions to root state (which we assume is the reference used for basecalling).
Novel (mutation has only happened once and is not a reversion to the root).

(Detecting runs of deletions/insertions which are homoplasic isn't trivial, but it is if we consider them as a series of individual events, as we currently do.)

These could be computed within auspice itself, unless there is some reason to leverage nextclade for this?

Relatedly, we should definitely move towards the aesthetics employed by nextclade for displaying mutations with badges!

What about Insertions?

It'd be wise to consider how insertions could be provided here, but this may be worthy of a separate issue (and shouldn't hold up implementing the previous sections). VCF-like style would be <FROM><POS><TO> where <TO> is >1 characters and includes <FROM>. For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2). However it's worth noting our deletion syntax doesn't follow VCF style. A style which follows our deletion syntax might be more along the lines of 3TAG. It's not clear to me how to reference subsequent changes in the insertion (e.g. if a later event modifies the inserted bases).

AngieHinrichs · 2022-01-21T00:14:43Z

For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2).

My 2c: please don't follow VCF off that cliff. :) I really, really wish VCF didn't include the base to the left of indels. It's distracting to include a base that does not change, it necessitated an additional special rule for insertions at the beginning of the sequence (the unchanging base to the right must be appended on the right, further ugh), and it complicates code that has to translate between VCF and other formats (for example requiring reference sequence input to convert to VCF when it would otherwise be unnecessary). The empty string is a perfectly valid <FROM> for a point insertion IMO. :) Some formats use "-" to avoid using the empty string. There are multiple better alternatives to VCF's base-to-the-left. </rant>

The rest of it sounds great! :)

emmahodcroft · 2022-01-21T08:48:25Z

I only worked with VCFs for a short while a few years ago but I'd second Angie here, it drove me nuts!

rneher · 2022-01-21T09:13:37Z

in nextclade, we have use <position-before-insertion><inserted-sequence>, like list insertion conventions. that position to the left can be 0 (one-based indexing) or -1 (zero-based indexing) when the insertion precedes the reference.

jameshadfield · 2022-01-26T05:17:37Z

Update:

I have this working for the on-click info panel, just need to extend it to the on-hover panel as well. I think subsequent PRs can then

collect runs of Ns / gaps into one visual element
implement the nice badges from Nextclade (I've got a proof of principle working here)
consider insertions. This has to start in augur I think.

emmahodcroft · 2022-01-26T08:34:14Z

This looks super awesome and incredibly useful James!

These changes were motivated by issue #1444 [1] where separating mutations into categories can aid both QC and biological interpretation. I chose to use "mutations" to refer to mutations observed on a branch and "changes" to refer to the collection of mutations between a tip and the root. The categories are not necessarily disjoint, as a mutation back to the root will also be a homoplasy or a unique mutation. Note that changes between a tip sequence and the root aren't grouped into homoplasies, as a single change (A→C) may be the result of multiple mutations (e.g. A→B→C) and thus we would need to check the tip state of each position which is difficult with the current code. On-hover panels are left unchanged in this commit. [1] #1444

corneliusroemer · 2024-01-16T16:07:54Z

I think this has been part of Auspice for a while now @jameshadfield

It's a cool feature that's been super useful. But just wanted to check with you this is actually done before closing.

jameshadfield · 2024-01-18T20:58:02Z

Closed by #1449

corneliusroemer added the enhancement New feature or request label Jan 17, 2022

jameshadfield self-assigned this Jan 26, 2022

jameshadfield mentioned this issue Jan 28, 2022

categorise mutations #1449

Merged

victorlin added this to Nextstrain planning (archived) Feb 2, 2022

victorlin moved this to New in Nextstrain planning (archived) Feb 2, 2022

victorlin moved this from New to In Progress in Nextstrain planning (archived) Feb 2, 2022

jameshadfield closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reversions/back-mutations as within-Auspice-computed branch label #1444

Add reversions/back-mutations as within-Auspice-computed branch label #1444

corneliusroemer commented Jan 17, 2022

jameshadfield commented Jan 19, 2022 •

edited

Loading

AngieHinrichs commented Jan 21, 2022

emmahodcroft commented Jan 21, 2022

rneher commented Jan 21, 2022

jameshadfield commented Jan 26, 2022

emmahodcroft commented Jan 26, 2022

corneliusroemer commented Jan 16, 2024

jameshadfield commented Jan 18, 2024

Add reversions/back-mutations as within-Auspice-computed branch label #1444

Add reversions/back-mutations as within-Auspice-computed branch label #1444

Comments

corneliusroemer commented Jan 17, 2022

jameshadfield commented Jan 19, 2022 • edited Loading

AngieHinrichs commented Jan 21, 2022

emmahodcroft commented Jan 21, 2022

rneher commented Jan 21, 2022

jameshadfield commented Jan 26, 2022

emmahodcroft commented Jan 26, 2022

corneliusroemer commented Jan 16, 2024

jameshadfield commented Jan 18, 2024

jameshadfield commented Jan 19, 2022 •

edited

Loading