Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to mask S:31 (21653-21655) for KP.3.1.1 (and other S:31del lineages) #1808

Closed
xz-keg opened this issue Aug 1, 2024 · 29 comments
Closed
Labels
Discussion Usher Issues with usher related problems

Comments

@xz-keg
Copy link
Contributor

xz-keg commented Aug 1, 2024

There seems to be a large KP.3.1.1+S:S31F branch, while KP.3.1.1 shall have S:31del.

That branch is driven by Denmark seqs which do not handle S:S31del well. When Querying C12616T, A13121T, C21654T all seqs are from Denmark, clearly an artifact.

However, that branch seems to attract seqs with no coverage at S:31 positions now, and form a very large artifact S:S31F branch under KP.3.1.1

@AngieHinrichs. Suggest to mask the deleted part of KP.3.1.1(and other S:31del lineages) on usher.

https://nextstrain.org/fetch/genome-test.gi.ucsc.edu/trash/ct/subtreeAuspice20_genome_test_32f78_b095f0.json?label=id:node_5486714
image

@xz-keg xz-keg changed the title Suggestion to mask S:31 (21653-21655) for KP.3.1.1 Suggestion to mask S:31 (21653-21655) for KP.3.1.1 (and other S:31del lineages) Aug 1, 2024
@xz-keg xz-keg added the Usher Issues with usher related problems label Aug 1, 2024
@FedeGueli
Copy link

I m not sure it is a good idea. in case of recombination we won't see that . We know that it is an artifact, but lets see how @corneliusroemer and @AngieHinrichs want to handle this.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 2, 2024

I m not sure it is a good idea. in case of recombination we won't see that . We know that it is an artifact, but lets see how @corneliusroemer and @AngieHinrichs want to handle this.

The problem is that I see a lot of further lineages (KP.3.1.1+Spike) being placed under this, messing up the tree.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 6, 2024

This problem also appears in KP.2.15 and LB.1.2 sub-trees now.

cov-lineages/pango-designation#2711
cov-lineages/pango-designation#2712

@AngieHinrichs
Copy link

C21654T is the (or a) defining mutation of a bunch of JN.1 sublineages:

  • JN.1.1.7 along with C11747T
  • JN.1.20
  • MB.1 (JN.1.49.1) along with T22928G
  • JN.1.58.3
  • JN.2.1
  • KP.2.7 along with T22795G
  • KP.2.14 along with G22113T
  • KP.3.1.3
  • MK.1 (KP.3.1.6.1)
  • KP.3.4.1 along with G3839A and C28674T
  • KP.4.2.1 along with C13862T

If all of those are artefacts (?), then the lineages JN.1.20, JN.1.58.3, JN.2.1, KP.3.1.3 MK.1, and JN.1.1.7 (on big C11747T polytomy) need to be retracted because they have no other defining mutation.

grep S:.31F ~/github/pango-designation/lineage_notes.txt

JN.1.1.7        Alias of B.1.1.529.2.86.1.1.1.7, S:S31F, from sars-cov-2-variants/lineage-proposals#1157
KP.2.7  Alias of B.1.1.529.2.86.1.1.11.1.2.7, S:S31F, after T22795G
KP.2.14 Alias of B.1.1.529.2.86.1.1.11.1.2.14, S:G184V, S:S31F, from sars-cov-2-variants/lineage-proposals#1578
KP.3.1.3        Alias of B.1.1.529.2.86.1.1.11.1.3.1.3, S:S31F,from sars-cov-2-variants/lineage-proposals#1576
MK.1    Alias of B.1.1.529.2.86.1.1.11.1.3.1.6.1, S:S31F, UK
KP.3.4.1        Alias of B.1.1.529.2.86.1.1.11.1.3.4.1, S:S31F, N:A134V, ORF1a:E1192K
KP.4.2.1        Alias of B.1.1.529.2.86.1.1.11.1.4.2.1, S:S31F
JN.1.20 Alias of B.1.1.529.2.86.1.1.20, S:S31F, directly on JN.1 polytomy
MB.1    Alias of B.1.1.529.2.86.1.1.49.1.1, S:F456V (T22928G), S:S31F
JN.1.58.3       Alias of B.1.1.529.2.86.1.1.58.3, S:S31F
JN.2.1  Alias of B.1.1.529.2.86.1.2.1, S:S31F, Sweden/Australia

@AngieHinrichs
Copy link

S:31del (or S:S31del) is listed for several JN.1 descendant lineages, and I see that includes KP.3.1.1, but as far as I can see they don't seem to be ancestors or siblings of the lineages with S:S31F above (except for KP.2.14 above & KP.2.15 below, and KP.3.1.3 above & KP.3.1.1 below).

grep 31del ~/github/pango-designation/lineage_notes.txt

KP.1.1.3        Alias of B.1.1.529.2.86.1.1.11.1.1.1.3, C4999T, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
KP.2.3  Alias of B.1.1.529.2.86.1.1.11.1.2.3, S:H146Q, ORF3a:K67N, S:31del, from sars-cov-2-variants/lineage-proposals#1459
KP.2.15 Alias of B.1.1.529.2.86.1.1.11.1.2.15, A10861G, S:31del, USA/Canada
KP.3.1.1        Alias of B.1.1.529.2.86.1.1.11.1.3.1.1, ORF1a:S4286C, C12616T, S:S31del, Spain, from sars-cov-2-variants/lineage-proposals#1563
KP.4.1.3        Alias of B.1.1.529.2.86.1.1.11.1.4.1.3, ORF1a:M598V, A5245G, S:31del 
LF.2    Alias of B.1.1.529.2.86.1.1.16.1.2, ORF1a:K247R, ORF3a:Y184H, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
LF.4.1  Alias of B.1.1.529.2.86.1.1.16.1.4.1, S:31del, ORF1a:G519S, ORF8:Q29*, from sars-cov-2-variants/lineage-proposals#1590
MA.1    Alias of B.1.1.529.2.86.1.1.18.3.1, S:R190S, S:31del, from sars-cov-2-variants/lineage-proposals#1635

@corneliusroemer how have you been distinguishing between S:31del and S:S31F?

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Aug 7, 2024

I think OP suggests to only mask 21653-21655 where we know that these are deleted. All S:31- branches that have been designated have at least one nuc substitution that define them - because Usher is blind to deletions. The nuc substitutions are how you're annotating these lineages I think @AngieHinrichs, is that correct?

S:S31F does show up independently and it's not usually a sequencing artefact - but I think it can happen that S:S31F branches get wrongly placed into S:S31- branches - something that shouldn't happen except for recombination (or very unlikely an exact insertion.

@aviczhl2 is right that Denmark struggles with indels. So in this case it is very likely an artefact that should be removed.

Whether to mask generally or not - I'm not sure. What we should really do is mask the position of this deletion in Danish sequences as the Danish pipeline seems to frequently call the deletion as S:S31F instead.

@AngieHinrichs re your question how I find the S:31- lineages: I usually query for that deletion in GISAID/covSpectrum and place those sequences with S:31del in Usher. I essentially do a manual ancestral inferrence of the state at position S:31- like that to find the node where the deletion likely started to appear. Of course it's not perfect since Usher is blind to deletions but it seems to work pretty well.

If you want to confirm that the designations are correct, you could create a simple TSV and drop it onto an Auspice view of an Usher subtree (Auspice can add additional metadata colorings from drag&dropped tsv/csv)

CSV would look like this for example:

strain_name, S31_genotype
Denmark/DCGC-686892/2024|OZ120425.1|2024-06-24, F
...

If you drop this, it would give you a new "coloring" called "S31_genotype" which one could use to find the branch on which the deletion appears to have started to happen.

Does that make sense?

@corneliusroemer
Copy link
Contributor

By the way, I think that the reason that S:S31F is sometimes called instead of deletion is that the difference between S:S31F and deletion is the length of a stretch of T homopolymers:

Brave Browser 2024-08-07 17 23 13

Essentially, the difference between S:S31- and S:S31F is just whether the stretch of Ts is of length 3 or 6. So I can see how some pipelines might get that wrong, especially if the pipeline is not very pegged to a reference but more de-novo like, which is good to avoid bias to reference, but in this case causes a different type of artefact.

@corneliusroemer
Copy link
Contributor

If all of those are artefacts (?), then the lineages JN.1.20, JN.1.58.3, JN.2.1, KP.3.1.3 MK.1, and JN.1.1.7 (on big C11747T polytomy) need to be retracted because they have no other defining mutation.

I don't think they are all artefacts - it's possible they are but unlikely, because it seems to be only the Danish sequences that have the miscalling of deletion -> F. Whenever there is a natural country distribution and multiple labs, it's unlikely that artefact is happening (in which case we should of course retract the lineage - but I haven't seen any convincing evidence to that end, but I also haven't looked at those again since designating).

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 8, 2024

S:31del (or S:S31del) is listed for several JN.1 descendant lineages, and I see that includes KP.3.1.1, but as far as I can see they don't seem to be ancestors or siblings of the lineages with S:S31F above (except for KP.2.14 above & KP.2.15 below, and KP.3.1.3 above & KP.3.1.1 below).

grep 31del ~/github/pango-designation/lineage_notes.txt

KP.1.1.3        Alias of B.1.1.529.2.86.1.1.11.1.1.1.3, C4999T, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
KP.2.3  Alias of B.1.1.529.2.86.1.1.11.1.2.3, S:H146Q, ORF3a:K67N, S:31del, from sars-cov-2-variants/lineage-proposals#1459
KP.2.15 Alias of B.1.1.529.2.86.1.1.11.1.2.15, A10861G, S:31del, USA/Canada
KP.3.1.1        Alias of B.1.1.529.2.86.1.1.11.1.3.1.1, ORF1a:S4286C, C12616T, S:S31del, Spain, from sars-cov-2-variants/lineage-proposals#1563
KP.4.1.3        Alias of B.1.1.529.2.86.1.1.11.1.4.1.3, ORF1a:M598V, A5245G, S:31del 
LF.2    Alias of B.1.1.529.2.86.1.1.16.1.2, ORF1a:K247R, ORF3a:Y184H, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
LF.4.1  Alias of B.1.1.529.2.86.1.1.16.1.4.1, S:31del, ORF1a:G519S, ORF8:Q29*, from sars-cov-2-variants/lineage-proposals#1590
MA.1    Alias of B.1.1.529.2.86.1.1.18.3.1, S:R190S, S:31del, from sars-cov-2-variants/lineage-proposals#1635

I'm not suggesting to mask S31F for everything on JN.1. In fact it is one of the beneficial convergent mutations that appear many times in real world.
I'm only suggesting to mask S:S31F for lineages that are defined by S:S31-, as S31F on these lineages are clearly artefacts.

These designated S31- lineages are (for now):

C28714T branch of KP.2.3
KP.2.15
LB.1 except for LB.1.8
KP.3.1.1
KP.1.1.3
KP.4.1.3
MA.1
LF.2, LF.4.1 and LF.1.1.1
XDY

Please mask 21653-21655 for seqs on these lineages. (or alter the Danish 31F seqs belonging to these lineages to 31-)

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

@AngieHinrichs It is causing more and more trouble now.

usher
For example, almost every lineage on KP.3.1.1 now has a "back-up branch" on the S31F artefact branch.
image

@FedeGueli
Copy link

FedeGueli commented Aug 16, 2024

I think the queries don't miss any of them, the only real issue will be if a fast lineage emerges in Denmark that extensively misses 31del.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

I think the queries don't miss any of them, the only real issue will be if a fast lineage emerges in Denmark that extensively misses 31del.

Query won't miss but the usher tree will be very messy given each lineage being separated at 2 different places.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

@AngieHinrichs This bug is more harmful than normal artefacts.

Normal artefacts can only attract seqs without coverage at that position, seqs with correct coverage won't be affected unless a stable Flip-flop reversion branch is formed.

However, this bug can attract ALL SEQS as ALL SEQS have no coverage at S:31(because it is deleted), making the bug more serious in theory.

@FedeGueli
Copy link

FedeGueli commented Aug 16, 2024

To me the bug is not that serious, the tree can attract S:S31F only sequences not all. and masking it could instead hide a real recombination event being a lot of lineages expanding with 31P and 31F.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

To me the bug is not that serious, the tree can attract S:S31F only sequences not all. and masking it could instead hide a real recombination event being a lot of lineages expanding with 31P and 31F.

Nay. The tree can attract all 31del seqs as they have no coverage at S:31. No coverage=can place at anywhere. I believe usher work this way @AngieHinrichs . It does not only attract 31F seqs.

For example, #1881 is attracted despite not having S:S31F.

@corneliusroemer
Copy link
Contributor

Agreed with @aviczhl2. I wonder though: why hasn't Usher simply inferred 31 to be F already for all of KP.3.1.1? That would be the parsimonious solution. Reason is that some KP.3.1.1 are wrongly called reference (instead of N or deletion) - and that artefact is more common than the Danish one. Right?

I agree masking would make total sense due to the fact there'll be massive messiness that will only increase.

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

Agreed with @aviczhl2. I wonder though: why hasn't Usher simply inferred 31 to be F already for all of KP.3.1.1? That would be the parsimonious solution. Reason is that some KP.3.1.1 are wrongly called reference (instead of N or deletion) - and that artefact is more common than the Danish one. Right?

I agree masking would make total sense due to the fact there'll be massive messiness that will only increase.

Let me explain.

1:There's not many real S31F artefacts for S31del branches.
2: Usher will try to fill in mutations for seqs on codons with no coverage
3: All 31del seqs do not have coverage on S:31, as it is deleted.
4: 2+3=>Usher cannot handle deletions. It simple thinks seqs from 3 have missing coverage on S:31 and try to fill in mutations for them.
5: 3+4=>All seqs on KP.3.1.1(and other 31del branches) can be filled in either 31F artefact branches or normal branches that does not include any mutation on S:31.
6: 5 causes seqs to split at two positions, resulting a messy tree.

@AngieHinrichs

@corneliusroemer
Copy link
Contributor

@aviczhl2 that's not enough to explain screw up. Because if it was always either deletion or F, Usher would infer everything to be F and there would be no messy tree.

The requirement for messiness is that both types of artefacts exist here: wild type and F, instead of the correct deletion.

As long as it's only deletion plus one other base, it's ok, it will infer the base. The messiness here comes due to 2 base artefacts occuring.

Does that make sense?

@xz-keg
Copy link
Contributor Author

xz-keg commented Aug 16, 2024

@aviczhl2 that's not enough to explain screw up. Because if it was always either deletion or F, Usher would infer everything to be F and there would be no messy tree.

The requirement for messiness is that both types of artefacts exist here: wild type and F, instead of the correct deletion.

As long as it's only deletion plus one other base, it's ok, it will infer the base. The messiness here comes due to 2 base artefacts occuring.

Does that make sense?

Yeah. I think you're right. There is also the traditional base-filling artefacts that fills S for 31del.

@FedeGueli
Copy link

FedeGueli commented Oct 3, 2024

In the current situation with possible recombinations with fast 22N+ 31F 31P i would close this. ping @aviczhl2

@xz-keg
Copy link
Contributor Author

xz-keg commented Oct 3, 2024

In the current situation with possible recombinations with fast 22N+ 31F 31P i would close this. ping @aviczhl2

I don't think it is a good reason. We can identify those potential recombs using other mutations , and having every KP.3.1.1 descendant branch divided into two certainly won't help us identify those.

@xz-keg
Copy link
Contributor Author

xz-keg commented Oct 11, 2024

@AngieHinrichs
The KP.3.1.1+S31F artifact branch is >2000 now, causing much trouble idenfying lineages(nearly all significant KP.3.1.1-sublineages have a back-up branch inside it). At least this shall be solved.

@xz-keg
Copy link
Contributor Author

xz-keg commented Oct 31, 2024

Now the largest KP.3.1.1 branch is pruned but others still exist

@AngieHinrichs
Copy link

Last week I pruned all sequences in BA.2.86 with C21654T, and added them back using the new alignment method, nextclade instead of mafft. That should make the treatment of these at least more consistent. Unfortunately the deletions are still treated as 'N' -- I think it would be better to treat deletions as reference to prevent substitutions from being inferred. I will look into masking deletions to reference instead of N when passing input to usher-sampled.

@corneliusroemer
Copy link
Contributor

I think it would be better to treat deletions as reference to prevent substitutions from being inferred.

Only as long as what was deleted doesn't contain a mutation? I think as long as there are 2 differente bases circulating somewhere where there's also a deletion there's just no clean solution as long as Usher doesn't annotate Ns and treats them specially (not imputing)? It might be that I haven't thought it through though, this is my gut feeling only.

@AngieHinrichs
Copy link

Good point, if there's a substitution and then some descendants get a deletion at the same position, then treating deletion as reference instead of N would create the appearance of a reversion at that position. When there's an assembly pipeline that has trouble identifying deletions, I'm not sure what it would do in that situation. I still think the "reversion" would still make for a bit more stability than Ns which can be imputed to anything.

@AngieHinrichs
Copy link

I now remove deletions from the input to UShER so that they're no longer treated as Ns. I pruned all sequences that had (or were imputed to have) C21654T and re-added them, and now there are very few sequences on the KP.3.1.1 > C21654T branch, and mostly from Denmark.

@FedeGueli
Copy link

FedeGueli commented Nov 5, 2024

I now remove deletions from the input to UShER so that they're no longer treated as Ns. I pruned all sequences that had (or were imputed to have) C21654T and re-added them, and now there are very few sequences on the KP.3.1.1 > C21654T branch, and mostly from Denmark.

Thank you Angie great work! Yeah Denmark not only calls badly deletions but calls them with a different AA so "" it is right"" to see them there.

@DailyCovidCases
Copy link

Dead please close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Usher Issues with usher related problems
Projects
None yet
Development

No branches or pull requests

5 participants