-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion to mask S:31 (21653-21655) for KP.3.1.1 (and other S:31del lineages) #1808
Comments
I m not sure it is a good idea. in case of recombination we won't see that . We know that it is an artifact, but lets see how @corneliusroemer and @AngieHinrichs want to handle this. |
The problem is that I see a lot of further lineages (KP.3.1.1+Spike) being placed under this, messing up the tree. |
This problem also appears in KP.2.15 and LB.1.2 sub-trees now. cov-lineages/pango-designation#2711 |
C21654T is the (or a) defining mutation of a bunch of JN.1 sublineages:
If all of those are artefacts (?), then the lineages JN.1.20, JN.1.58.3, JN.2.1, KP.3.1.3 MK.1, and JN.1.1.7 (on big C11747T polytomy) need to be retracted because they have no other defining mutation.
|
S:31del (or S:S31del) is listed for several JN.1 descendant lineages, and I see that includes KP.3.1.1, but as far as I can see they don't seem to be ancestors or siblings of the lineages with S:S31F above (except for KP.2.14 above & KP.2.15 below, and KP.3.1.3 above & KP.3.1.1 below).
@corneliusroemer how have you been distinguishing between S:31del and S:S31F? |
I think OP suggests to only mask 21653-21655 where we know that these are deleted. All S:31- branches that have been designated have at least one nuc substitution that define them - because Usher is blind to deletions. The nuc substitutions are how you're annotating these lineages I think @AngieHinrichs, is that correct? S:S31F does show up independently and it's not usually a sequencing artefact - but I think it can happen that S:S31F branches get wrongly placed into S:S31- branches - something that shouldn't happen except for recombination (or very unlikely an exact insertion. @aviczhl2 is right that Denmark struggles with indels. So in this case it is very likely an artefact that should be removed. Whether to mask generally or not - I'm not sure. What we should really do is mask the position of this deletion in Danish sequences as the Danish pipeline seems to frequently call the deletion as S:S31F instead. @AngieHinrichs re your question how I find the S:31- lineages: I usually query for that deletion in GISAID/covSpectrum and place those sequences with S:31del in Usher. I essentially do a manual ancestral inferrence of the state at position S:31- like that to find the node where the deletion likely started to appear. Of course it's not perfect since Usher is blind to deletions but it seems to work pretty well. If you want to confirm that the designations are correct, you could create a simple TSV and drop it onto an Auspice view of an Usher subtree (Auspice can add additional metadata colorings from drag&dropped tsv/csv) CSV would look like this for example:
If you drop this, it would give you a new "coloring" called "S31_genotype" which one could use to find the branch on which the deletion appears to have started to happen. Does that make sense? |
I don't think they are all artefacts - it's possible they are but unlikely, because it seems to be only the Danish sequences that have the miscalling of deletion -> F. Whenever there is a natural country distribution and multiple labs, it's unlikely that artefact is happening (in which case we should of course retract the lineage - but I haven't seen any convincing evidence to that end, but I also haven't looked at those again since designating). |
I'm not suggesting to mask S31F for everything on JN.1. In fact it is one of the beneficial convergent mutations that appear many times in real world. These designated S31- lineages are (for now): C28714T branch of KP.2.3 Please mask 21653-21655 for seqs on these lineages. (or alter the Danish 31F seqs belonging to these lineages to 31-) |
@AngieHinrichs It is causing more and more trouble now. usher |
I think the queries don't miss any of them, the only real issue will be if a fast lineage emerges in Denmark that extensively misses 31del. |
Query won't miss but the usher tree will be very messy given each lineage being separated at 2 different places. |
@AngieHinrichs This bug is more harmful than normal artefacts. Normal artefacts can only attract seqs without coverage at that position, seqs with correct coverage won't be affected unless a stable Flip-flop reversion branch is formed. However, this bug can attract ALL SEQS as ALL SEQS have no coverage at S:31(because it is deleted), making the bug more serious in theory. |
To me the bug is not that serious, the tree can attract S:S31F only sequences not all. and masking it could instead hide a real recombination event being a lot of lineages expanding with 31P and 31F. |
Nay. The tree can attract all 31del seqs as they have no coverage at S:31. No coverage=can place at anywhere. I believe usher work this way @AngieHinrichs . It does not only attract 31F seqs. For example, #1881 is attracted despite not having S:S31F. |
Agreed with @aviczhl2. I wonder though: why hasn't Usher simply inferred 31 to be F already for all of KP.3.1.1? That would be the parsimonious solution. Reason is that some KP.3.1.1 are wrongly called reference (instead of N or deletion) - and that artefact is more common than the Danish one. Right? I agree masking would make total sense due to the fact there'll be massive messiness that will only increase. |
Let me explain. 1:There's not many real S31F artefacts for S31del branches. |
@aviczhl2 that's not enough to explain screw up. Because if it was always either deletion or F, Usher would infer everything to be F and there would be no messy tree. The requirement for messiness is that both types of artefacts exist here: wild type and F, instead of the correct deletion. As long as it's only deletion plus one other base, it's ok, it will infer the base. The messiness here comes due to 2 base artefacts occuring. Does that make sense? |
Yeah. I think you're right. There is also the traditional base-filling artefacts that fills S for 31del. |
In the current situation with possible recombinations with fast 22N+ 31F 31P i would close this. ping @aviczhl2 |
I don't think it is a good reason. We can identify those potential recombs using other mutations , and having every KP.3.1.1 descendant branch divided into two certainly won't help us identify those. |
@AngieHinrichs |
Now the largest KP.3.1.1 branch is pruned but others still exist |
Last week I pruned all sequences in BA.2.86 with C21654T, and added them back using the new alignment method, nextclade instead of mafft. That should make the treatment of these at least more consistent. Unfortunately the deletions are still treated as 'N' -- I think it would be better to treat deletions as reference to prevent substitutions from being inferred. I will look into masking deletions to reference instead of N when passing input to usher-sampled. |
Only as long as what was deleted doesn't contain a mutation? I think as long as there are 2 differente bases circulating somewhere where there's also a deletion there's just no clean solution as long as Usher doesn't annotate Ns and treats them specially (not imputing)? It might be that I haven't thought it through though, this is my gut feeling only. |
Good point, if there's a substitution and then some descendants get a deletion at the same position, then treating deletion as reference instead of N would create the appearance of a reversion at that position. When there's an assembly pipeline that has trouble identifying deletions, I'm not sure what it would do in that situation. I still think the "reversion" would still make for a bit more stability than Ns which can be imputed to anything. |
I now remove deletions from the input to UShER so that they're no longer treated as Ns. I pruned all sequences that had (or were imputed to have) C21654T and re-added them, and now there are very few sequences on the KP.3.1.1 > C21654T branch, and mostly from Denmark. |
Thank you Angie great work! Yeah Denmark not only calls badly deletions but calls them with a different AA so "" it is right"" to see them there. |
Dead please close it |
There seems to be a large KP.3.1.1+S:S31F branch, while KP.3.1.1 shall have S:31del.
That branch is driven by Denmark seqs which do not handle S:S31del well. When Querying C12616T, A13121T, C21654T all seqs are from Denmark, clearly an artifact.
However, that branch seems to attract seqs with no coverage at S:31 positions now, and form a very large artifact S:S31F branch under KP.3.1.1
@AngieHinrichs. Suggest to mask the deleted part of KP.3.1.1(and other S:31del lineages) on usher.
https://nextstrain.org/fetch/genome-test.gi.ucsc.edu/trash/ct/subtreeAuspice20_genome_test_32f78_b095f0.json?label=id:node_5486714
The text was updated successfully, but these errors were encountered: