Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XBB.1 Sublineage with S:E180V, S:K478R, S:S486P, ORF9b:I5T, ORF9b:N55S, ORF1a:L3829F, ORF1b:D1746Y (42 seq) #1723

Closed
ryhisner opened this issue Mar 4, 2023 · 24 comments
Assignees
Labels
designated recommended Recommended for designation by pango team member XBB proposed sublineage of XBB
Milestone

Comments

@ryhisner
Copy link

ryhisner commented Mar 4, 2023

Description
Sub-lineage of: XBB.1
Earliest sequence: 2023-1-23, USA, New York — EPI_ISL_16835403
Most recent sequence: 2023-2-24, India, Maharashtra— EPI_ISL_17073064; Singapore (with travel from India) — EPI_ISL_17030043; Denmark — EPI_ISL_17048705
Countries circulating: Primarily in India. Has been sequenced in India (23), USA (7—at least five with international travel history), Singapore (6—all with travel from India), England (2), Denmark (1), Germany (1), Ireland (1), Italy (1),
Number of Sequences: 42
GISAID Query: T12730A, T28297C, A28447G
CovSpectrum Query: T12730A, T28297C, A28447G
Substitutions on top of XBB.1:
Spike: E180V, K478R, S486P
ORF9b: I5T, N55S
ORF1a: L3829F (NSP6_L260F)
ORF1b: D1746Y (NSP14_D222Y)
Nucleotide: C11750T, C11956T, T12730A, A14856G, G18703T, A22101T, A22995G, T23018C, T28297C, A28447G, C29386T

USHER Tree
The Usher tree looks as if it has two very separate branches, but this is an artifact from the very low spike coverage in most of the Indian sequences here. The branches in the lower section of the tree consist almost entirely of artifactual reversions. Similarly, all the sequences that appear to lack S:E180V merely lack coverage there and therefore almost certainly possess it.
https://nextstrain.org/fetch/raw.githubusercontent.com/ryhisner/jsons/main/XBB.1_Lineage_20_seq_Tree_subtreeAuspice1_genome_1cd4f_28ede0.json
image

Evidence
This saltation lineage has already spread quite widely across the globe, but of the non-Indian sequences with adequate metadata about travel history, almost all indicate international travel, mostly from India. One USA sequence lists travel history from Ethiopia, two with India, and the rest do not specify a country (but are sequenced by Gingko Bioworks, which only sequences incoming international travelers). All six sequences from Singapore have travel history in India. Sequencing in India has been rather sparse of late, so this may comprise a substantial fraction of infections there, particularly given it was first sequenced on January 23.

S:K478R has been present in a few smaller lineages (CM.4.1, BA.2.38.3) and regularly appears in scattered sequences here and there. ORF1a:L3829F is of course found in all BQ* sequences, but it is also one of the most convergent ORF1a mutations found in chronic-infection mutations. ORF9b:I5T (T28297C) is in XBB.1.9 and has been posited to be the reason XBB.1.9 lineages seem to grow somewhat faster than XBB.1.5. ORF9b has been implicated in immune evasion, primarily interferon suppression I think, so it's possible ORF9b:N55S could confer some further resistance to immunity. Both of these ORF9b mutations are synonymous in N.

Genomes

Genomes
@AnonymousUserUse
Copy link

CovSpectrum Query: Nextcladepangolineage:

CoV-Spectrum query missing

Both of these ORF9b mutations are synonymous in N.

I cannot understand this sentence. What do you mean with N here?

@FedeGueli
Copy link
Contributor

FedeGueli commented Mar 4, 2023

N

CovSpectrum Query: Nextcladepangolineage:

CoV-Spectrum query missing

Both of these ORF9b mutations are synonymous in N.

I cannot understand this sentence. What do you mean with N here?

CovSpectrum: T12730A, T28297C, A28447G

Orf9b is just the alternate reading of Orf9a=N protein N.

@FedeGueli
Copy link
Contributor

@ryhisner i m seeing a lot of S:478R mainly from SA Russia and in XBB.1.5 .
It was defining in BH.1 that with BJ.1 and Ba.2.10.4 was a main actor the first era of heavy mutated BA.2 from Indian area won then by BA.2.75 and its recombinant XBB.

@FedeGueli
Copy link
Contributor

@corneliusroemer @thomaspeacock @InfrPopGen @AngieHinrichs i suggest a very fast designation of this one to monitor it as soon as possible ( i already added it to internal charts and its growth is in the top range comparable to both XBB.1.9.1 and XBB.1.9.2 at the same number of seqs) , from its profile i bet it will compete with the other leading XBB.1+486P spikes

@FedeGueli
Copy link
Contributor

We didnt care too much to XBB.1.9 early advantage but that was then shown real, so i highlight you that the signal is present here too and clearly also against XBB.1.9:
Schermata 2023-03-04 alle 17 49 40
https://cov-spectrum.org/explore/World/AllSamples/Past2M/variants?nextcladePangoLineage=XBB.1.9.1*&nucMutations1=T12730A%2CT28297C%2CA28447G&analysisMode=CompareToBaseline&

@ryhisner
Copy link
Author

ryhisner commented Mar 4, 2023

@AnonymousUserUse, ORF9b overlaps with N (nucleocapsid) in the SARS-CoV-2 genome, but they are out of frame with respect to each other, meaning that a nucleotide mutation that results in an amino acid (AA) substitution in ORF9b does not always cause an AA substitution in N. Nucleotide mutations that cause an AA substitution are called non-synonymous. Those that do not cause an AA change are called synonymous. Everything below is a layman's simplification, some of which may not be precisely correct but which I think gets the basic picture right.

For example, the nucleotide mutation T28297 is the third nucleotide in N:N8, which has the nucleotide sequence AAT. T28297C changes the sequence for this AA to AAC. However, both AAT and AAC code for the same amino acid: asparagine (symbolized by N). So T28297C is synonymous in N. In ORF9b, T28297 is the 2nd nucleotide the 5th amino acid, ORF9b:I5, whose nucleotides are ATC. T28297C changes this from ATC to ACC, which results in a change in amino acid from isoleucine (I) to threonine (T).

image

You can see how N and ORF9b overlap in the diagram below, which I pasted together using screenshots from NextClade. The N gene spans nucleotides 28274-29533 while ORF9b stretches from 28284-28577. The RNA-dependent RNA polymerase (RDRP), which basically makes copies of each viral gene by creating a complementary RNA strand, runs along the genome, beginning at the 3' end (the far right side in the diagram below). Each of the genes pictured (except ORF1b) has its own code (called a transcription regulatory sequence, or TRS) near its 5' end (left side in diagram) that the RDRP can recognize as a signal to stop, latch onto the RNA, and begin scanning the other direction. When it reaches a start codon (the nucleotide sequence ATG), it starts creating the complementary RNA strand. When it reaches a stop codon (TAA, TAG, or TGA), it stops copying.

image

@HynnSpylor
Copy link
Contributor

HynnSpylor commented Mar 4, 2023

Great proposal with amazing growth rate. I support XBB.1+S:F486P+X (any other important mutation) should also be monitoring carefully.
Several days ago I noticed two other possible sublineage (#1704 #1712 ) but missed it.

@xz-keg
Copy link
Contributor

xz-keg commented Mar 4, 2023

orf1a:L3829F again, it seems that this mutation occurs independently in many chronic seqs.
#405
#764
#770
#871
#1052, BS.1
#1266, BA.5.2.42
#1724

It seems that this mutation is convergent among chronic long branches.

@thomasppeacock thomasppeacock added recommended Recommended for designation by pango team member XBB proposed sublineage of XBB labels Mar 4, 2023
@AnonymousUserUse
Copy link

@ryhisner
Thanks a lot for the detailed explanation!
What is the range of ORF1a and ORF10? I have often heard of that, but cannot find an answer for the exact range of these two genes.

@AngieHinrichs
Copy link
Member

What is the range of ORF1a and ORF10? I have often heard of that, but cannot find an answer for the exact range of these two genes.

The NCBI RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 includes gene annotations at the nucleotide coding level and the protein level (ORF1a and ORF1ab are each split into several small proteins), so if you search for ORF1a and ORF10 on that page you can find their ranges in the reference genome and some other info about them.

The RefSeq annotations for NC_045512.2 include only N (a.k.a. ORF9 or nucleocapsid), they don't divide it into ORF9a and ORF9b.

Nextstrain's annotations include ORF9b: https://github.com/nextstrain/ncov/blob/master/defaults/annotation.gff Beware, those annotations also artificially split ORF1ab into separate ORF1a (which is real) and ORF1b (which is not real) in order to avoid having to account for ribosomal slippage in ORF1ab when translating nucleotide changes to protein changes.

@FedeGueli
Copy link
Contributor

orf1a 1-4401
orf10 29558-end of genome (3' end)

here what u need: https://codon2nucleotide.theo.io/

@AnonymousUserUse
Copy link

AnonymousUserUse commented Mar 4, 2023

Summary:

Gene Range of codon Range of nucleotide Used in Nextstrain Used in GISAID Real or not
ORF1ab 1-7098 266-21555 No Yes Real
ORF1a 1-4401 266-13468 Yes No Real
ORF1b 1-2696 13468-21555 Yes No Not real
S 1-1274 21563-25384 Yes Yes Real
ORF3a 1-276 25393-26220 Yes Yes Real
E 1-76 26245-26472 Yes Yes Real
M 1-223 26523-27191 Yes Yes Real
ORF6 1-62 27202-27387 Yes Yes Real
ORF7a 1-122 27394-27759 Yes Yes Real
ORF7b 1-44 27756-27887 Yes Yes Real
ORF8 1-122 27894-28259 Yes Yes Real
N 1-420 28274-29533 Yes Yes Real
ORF9b 1-98 28284-28577 Yes No Real
ORF10 1-39 29558-29674 No Yes Real

CoV-Spectrum also uses the annotation from Nextstrain.
https://codon2nucleotide.theo.io/ shows the annotation from GISAID.

Is so correct?
Thanks all. And I apologize for off-topic.

------edit 2023/3/5------
Range of M and ORF9b has been corrected

@FedeGueli
Copy link
Contributor

@InfrPopGen @corneliusroemer @thomasppeacock @AngieHinrichs
To better contestualize reccomended lineages @alurqu and me tried to add them to collection 24 to preview how they will rank min the global competion

In the case of this proposed issue it is worth flagging it even if low numbers should make us take this growth advantage with a big big grain of salt.
https://cov-spectrum.org/collections/24
Schermata 2023-03-05 alle 01 13 36

@Mike-Honey
Copy link

Is this S:S486P or S:F486P? I'm guessing F?

@c19850727
Copy link

@Mike-Honey it depends on how you choose your reference. if it's relative to ancestral then it's F486P, but I think @ryhisner is using XBB.1 as the reference here.

@NkRMnZr
Copy link

NkRMnZr commented Mar 5, 2023

Is this S:S486P or S:F486P? I'm guessing F?

Those are using different references, for XBB.1, or tracing back to BM.1.1.1 that is S on codon 486 of S protein; if ref to the root that is F on that codon.

@InfrPopGen InfrPopGen self-assigned this Mar 5, 2023
InfrPopGen added a commit that referenced this issue Mar 5, 2023
Added new lineage XBB.1.16 from #1723 with 3 new sequence designations, and 0 updated
@InfrPopGen InfrPopGen added this to the XBB.1.16 milestone Mar 5, 2023
@InfrPopGen
Copy link
Contributor

Thanks for submitting. We've added lineage XBB.1.16 with 3 newly designated sequences, and 0 updated. Defining mutations A22101T (S:E180V), T28297C, C29386T (following C11750T (ORF1a:L3829F), G18703T (ORF1b:D1746Y), A22995G (S:K478R), T23018C (S:S486P), A14856G, A28447G).

@FedeGueli
Copy link
Contributor

Thank you @InfrPopGen for your sunday work!

@corneliusroemer
Copy link
Contributor

I added extra sequences to make inference more robust - it was only 3 designations thus far.

The the lineage seems to be on a shared branch with XBB.1.12, defined by 11956T. Do you agree? The presence of many basal sequences and clean tree on that branch makes me think that this looks like a plausible sequence of events.

I was struck by some XBB.1.9 having the same mutation (11956T) - is that a defining mutation or was it pulled in to the tree via preferential sampling by @ryhisner? Maybe there's some dropout samples in the designated sequences causing >10% of sequences to miss that mutation. In that case, XBB.1.9 may also be on that branch.

Edit: I looked into 11956T in XBB.1.9 (sampling homogeneously from across XBB.1.9). It looks like 11956T pops up in XBB.1.9.1 only. Strange. So real homoplasy? Or could it be that this is an artefact in some way? Investigation welcome :)

That part of the tree is unfortunately messed up on Usher. I have a gut feeling that low coverage/bad qc sequences screw up the tree there. Maybe it would be possible to make 2 Usher trees: 1. One with a high bar for quality, maybe including only known labs of good quality, use that for the macro-structure. 2. Place lower quality sequences but make sure this doesn't cause flip-flopping. Maybe lower quality sequences need to be scored differently in the parsimony cost function - or flip-flopping needs to be penalized so that it gets optimized away. @AngieHinrichs I try very hard to make the Nextclade reference tree as close to what we consider to be the real tree in macro structure as possible. Maybe some sort of hybrid could be possible - macro constraint tree using human curation (pango lineages, manual constraint tree, overwriting artefacts etc) then letting usher fill in the gaps below there - and potentially suggest where the macro tree may be wrong.

image

image

@AngieHinrichs
Copy link
Member

That part of the tree is unfortunately messed up on Usher.

Yes, XBB.1.9 needs a little fixup. If you label w/back-mutations you can see a couple there. XBB.1.9.1 (XBB.1.9 > C11956T > S:S486P (T23018C)) has ~200 sequences with N:T362I (C29358T); then there are 5 sequences with N:T362I (C29358T) and S:S486P (T23018C) but without 11956T (so reversion T11956C) -- and those have pulled in sequences that have S:S486P (T23018C) but neither 11956T nor N:T362I (C29358T), including the XBB.1.9.2 branch, doh.

I think I can fix this by temporarily removing those 5 sequences and reoptimizing. But it would be even better to prevent these situations where a few wayward sequences can pull in a large branch in by adding reversions. I'm wondering if there is a way to prevent that from happening in matOptimize, or maybe to make a utility that recognizes the pattern and puts those post-reversion branches where they would go without reversions (unless that would be a clear loss parsimony-wise).

I'm not sure that excluding all sequences from certain labs is the right way to go about it (although I agree some labs seem to cause more trouble than others). Even labs that produce some frustrating sequences also produce some OK sequences, and sometimes they cover an undersampled part of the world. On the other hand, any lab that produces tons of sequences will produce some bad ones that cause trouble like this even if overall quality is good.

@FedeGueli
Copy link
Contributor

@corneliusroemer C11956T is highly homoplsic i highlighted you back in the xbb.1.9.1/2 issue. I dont think it has missed one single lineage from the start of the pandemic (i m exaggerating to make things clear, it popped up everywhere, everytime)
Schermata 2023-03-08 alle 12 55 36
https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nucMutations=C11956T&aaMutations1=S%3A153I%2CS%3A1258Q%2CN%3A151L&

@xz-keg
Copy link
Contributor

xz-keg commented Mar 8, 2023

I'm wondering if there is a way to prevent that from happening in matOptimize, or maybe to make a utility that recognizes the pattern and puts those post-reversion branches where they would go without reversions (unless that would be a clear loss parsimony-wise).

Is there a way to do that? For example, adding punishment for reversions? (For example, giving 2 instead of 1 for reversion mutations in parsimony score)

@AngieHinrichs
Copy link
Member

Is there a way to do that? For example, adding punishment for reversions? (For example, giving 2 instead of 1 for reversion mutations in parsimony score)

Yes, I think giving 2 instead of 1 would get rid of most reversions -- but occasionally there is a real one, e.g. in BA.2 we saw several genuine reversions of A23040G (S:Q493R), and any time there is a recombinant, reversions may be helpful to place it on or near one of its parent lineages.

Adding a small fractional penalty for reversions would probably be nice, but it would mess up the simple integer parsimony scoring that is very fast. I have seen an alignment scoring scheme that uses larger integers like 100 instead of 1 for a match, so certain nucleotide matches/changes could be given slightly higher or lower scores/penalties calibrated to the species being aligned (Chiaromonte 2002), and gap extension could be penalized significantly less than gap initiation.

I think there are tie-breaking situations in which reversions are supposed to be favored less. Really I don't understand matOptimize well enough to know exactly what to ask for. I need to compile some good examples and test cases for @yceh and hope he has some time. :)

@xz-keg
Copy link
Contributor

xz-keg commented Mar 10, 2023

Is there a way to do that? For example, adding punishment for reversions? (For example, giving 2 instead of 1 for reversion mutations in parsimony score)

Yes, I think giving 2 instead of 1 would get rid of most reversions -- but occasionally there is a real one, e.g. in BA.2 we saw several genuine reversions of A23040G (S:Q493R), and any time there is a recombinant, reversions may be helpful to place it on or near one of its parent lineages.

I guess under parsimony score=2-3, real reversions like BA.2 will still be detected, as they're usually combined with other groups of mutations that makes parsimony score=2-3 of that reversion still optimal.

While false reversions will be largely reduced under score=2-3, making branch-specific labels to counter the remaining false reversions more applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
designated recommended Recommended for designation by pango team member XBB proposed sublineage of XBB
Projects
None yet
Development

No branches or pull requests