-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XBB.1 Sublineage with S:E180V, S:K478R, S:S486P, ORF9b:I5T, ORF9b:N55S, ORF1a:L3829F, ORF1b:D1746Y (42 seq) #1723
Comments
CoV-Spectrum query missing
I cannot understand this sentence. What do you mean with N here? |
N
CovSpectrum: T12730A, T28297C, A28447G Orf9b is just the alternate reading of Orf9a=N protein N. |
@ryhisner i m seeing a lot of S:478R mainly from SA Russia and in XBB.1.5 . |
@corneliusroemer @thomaspeacock @InfrPopGen @AngieHinrichs i suggest a very fast designation of this one to monitor it as soon as possible ( i already added it to internal charts and its growth is in the top range comparable to both XBB.1.9.1 and XBB.1.9.2 at the same number of seqs) , from its profile i bet it will compete with the other leading XBB.1+486P spikes |
We didnt care too much to XBB.1.9 early advantage but that was then shown real, so i highlight you that the signal is present here too and clearly also against XBB.1.9: |
@AnonymousUserUse, ORF9b overlaps with N (nucleocapsid) in the SARS-CoV-2 genome, but they are out of frame with respect to each other, meaning that a nucleotide mutation that results in an amino acid (AA) substitution in ORF9b does not always cause an AA substitution in N. Nucleotide mutations that cause an AA substitution are called non-synonymous. Those that do not cause an AA change are called synonymous. Everything below is a layman's simplification, some of which may not be precisely correct but which I think gets the basic picture right. For example, the nucleotide mutation T28297 is the third nucleotide in N:N8, which has the nucleotide sequence AAT. T28297C changes the sequence for this AA to AAC. However, both AAT and AAC code for the same amino acid: asparagine (symbolized by N). So T28297C is synonymous in N. In ORF9b, T28297 is the 2nd nucleotide the 5th amino acid, ORF9b:I5, whose nucleotides are ATC. T28297C changes this from ATC to ACC, which results in a change in amino acid from isoleucine (I) to threonine (T). You can see how N and ORF9b overlap in the diagram below, which I pasted together using screenshots from NextClade. The N gene spans nucleotides 28274-29533 while ORF9b stretches from 28284-28577. The RNA-dependent RNA polymerase (RDRP), which basically makes copies of each viral gene by creating a complementary RNA strand, runs along the genome, beginning at the 3' end (the far right side in the diagram below). Each of the genes pictured (except ORF1b) has its own code (called a transcription regulatory sequence, or TRS) near its 5' end (left side in diagram) that the RDRP can recognize as a signal to stop, latch onto the RNA, and begin scanning the other direction. When it reaches a start codon (the nucleotide sequence ATG), it starts creating the complementary RNA strand. When it reaches a stop codon (TAA, TAG, or TGA), it stops copying. |
@ryhisner |
The NCBI RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 includes gene annotations at the nucleotide coding level and the protein level (ORF1a and ORF1ab are each split into several small proteins), so if you search for ORF1a and ORF10 on that page you can find their ranges in the reference genome and some other info about them. The RefSeq annotations for NC_045512.2 include only N (a.k.a. ORF9 or nucleocapsid), they don't divide it into ORF9a and ORF9b. Nextstrain's annotations include ORF9b: https://github.com/nextstrain/ncov/blob/master/defaults/annotation.gff Beware, those annotations also artificially split ORF1ab into separate ORF1a (which is real) and ORF1b (which is not real) in order to avoid having to account for ribosomal slippage in ORF1ab when translating nucleotide changes to protein changes. |
orf1a 1-4401 here what u need: https://codon2nucleotide.theo.io/ |
Summary:
CoV-Spectrum also uses the annotation from Nextstrain. Is so correct? ------edit 2023/3/5------ |
@InfrPopGen @corneliusroemer @thomasppeacock @AngieHinrichs In the case of this proposed issue it is worth flagging it even if low numbers should make us take this growth advantage with a big big grain of salt. |
Is this S:S486P or S:F486P? I'm guessing F? |
@Mike-Honey it depends on how you choose your reference. if it's relative to ancestral then it's F486P, but I think @ryhisner is using XBB.1 as the reference here. |
Those are using different references, for XBB.1, or tracing back to BM.1.1.1 that is S on codon 486 of S protein; if ref to the root that is F on that codon. |
Added new lineage XBB.1.16 from #1723 with 3 new sequence designations, and 0 updated
Thanks for submitting. We've added lineage XBB.1.16 with 3 newly designated sequences, and 0 updated. Defining mutations A22101T (S:E180V), T28297C, C29386T (following C11750T (ORF1a:L3829F), G18703T (ORF1b:D1746Y), A22995G (S:K478R), T23018C (S:S486P), A14856G, A28447G). |
Thank you @InfrPopGen for your sunday work! |
I added extra sequences to make inference more robust - it was only 3 designations thus far. The the lineage seems to be on a shared branch with XBB.1.12, defined by 11956T. Do you agree? The presence of many basal sequences and clean tree on that branch makes me think that this looks like a plausible sequence of events. I was struck by some XBB.1.9 having the same mutation (11956T) - is that a defining mutation or was it pulled in to the tree via preferential sampling by @ryhisner? Maybe there's some dropout samples in the designated sequences causing >10% of sequences to miss that mutation. In that case, XBB.1.9 may also be on that branch. Edit: I looked into 11956T in XBB.1.9 (sampling homogeneously from across XBB.1.9). It looks like 11956T pops up in XBB.1.9.1 only. Strange. So real homoplasy? Or could it be that this is an artefact in some way? Investigation welcome :) That part of the tree is unfortunately messed up on Usher. I have a gut feeling that low coverage/bad qc sequences screw up the tree there. Maybe it would be possible to make 2 Usher trees: 1. One with a high bar for quality, maybe including only known labs of good quality, use that for the macro-structure. 2. Place lower quality sequences but make sure this doesn't cause flip-flopping. Maybe lower quality sequences need to be scored differently in the parsimony cost function - or flip-flopping needs to be penalized so that it gets optimized away. @AngieHinrichs I try very hard to make the Nextclade reference tree as close to what we consider to be the real tree in macro structure as possible. Maybe some sort of hybrid could be possible - macro constraint tree using human curation (pango lineages, manual constraint tree, overwriting artefacts etc) then letting usher fill in the gaps below there - and potentially suggest where the macro tree may be wrong. |
Yes, XBB.1.9 needs a little fixup. If you label w/back-mutations you can see a couple there. XBB.1.9.1 (XBB.1.9 > C11956T > S:S486P (T23018C)) has ~200 sequences with N:T362I (C29358T); then there are 5 sequences with N:T362I (C29358T) and S:S486P (T23018C) but without 11956T (so reversion T11956C) -- and those have pulled in sequences that have S:S486P (T23018C) but neither 11956T nor N:T362I (C29358T), including the XBB.1.9.2 branch, doh. I think I can fix this by temporarily removing those 5 sequences and reoptimizing. But it would be even better to prevent these situations where a few wayward sequences can pull in a large branch in by adding reversions. I'm wondering if there is a way to prevent that from happening in matOptimize, or maybe to make a utility that recognizes the pattern and puts those post-reversion branches where they would go without reversions (unless that would be a clear loss parsimony-wise). I'm not sure that excluding all sequences from certain labs is the right way to go about it (although I agree some labs seem to cause more trouble than others). Even labs that produce some frustrating sequences also produce some OK sequences, and sometimes they cover an undersampled part of the world. On the other hand, any lab that produces tons of sequences will produce some bad ones that cause trouble like this even if overall quality is good. |
@corneliusroemer C11956T is highly homoplsic i highlighted you back in the xbb.1.9.1/2 issue. I dont think it has missed one single lineage from the start of the pandemic (i m exaggerating to make things clear, it popped up everywhere, everytime) |
Is there a way to do that? For example, adding punishment for reversions? (For example, giving 2 instead of 1 for reversion mutations in parsimony score) |
Yes, I think giving 2 instead of 1 would get rid of most reversions -- but occasionally there is a real one, e.g. in BA.2 we saw several genuine reversions of A23040G (S:Q493R), and any time there is a recombinant, reversions may be helpful to place it on or near one of its parent lineages. Adding a small fractional penalty for reversions would probably be nice, but it would mess up the simple integer parsimony scoring that is very fast. I have seen an alignment scoring scheme that uses larger integers like 100 instead of 1 for a match, so certain nucleotide matches/changes could be given slightly higher or lower scores/penalties calibrated to the species being aligned (Chiaromonte 2002), and gap extension could be penalized significantly less than gap initiation. I think there are tie-breaking situations in which reversions are supposed to be favored less. Really I don't understand matOptimize well enough to know exactly what to ask for. I need to compile some good examples and test cases for @yceh and hope he has some time. :) |
I guess under parsimony score=2-3, real reversions like BA.2 will still be detected, as they're usually combined with other groups of mutations that makes parsimony score=2-3 of that reversion still optimal. While false reversions will be largely reduced under score=2-3, making branch-specific labels to counter the remaining false reversions more applicable. |
Description
Sub-lineage of: XBB.1
Earliest sequence: 2023-1-23, USA, New York — EPI_ISL_16835403
Most recent sequence: 2023-2-24, India, Maharashtra— EPI_ISL_17073064; Singapore (with travel from India) — EPI_ISL_17030043; Denmark — EPI_ISL_17048705
Countries circulating: Primarily in India. Has been sequenced in India (23), USA (7—at least five with international travel history), Singapore (6—all with travel from India), England (2), Denmark (1), Germany (1), Ireland (1), Italy (1),
Number of Sequences: 42
GISAID Query: T12730A, T28297C, A28447G
CovSpectrum Query: T12730A, T28297C, A28447G
Substitutions on top of XBB.1:
Spike: E180V, K478R, S486P
ORF9b: I5T, N55S
ORF1a: L3829F (NSP6_L260F)
ORF1b: D1746Y (NSP14_D222Y)
Nucleotide: C11750T, C11956T, T12730A, A14856G, G18703T, A22101T, A22995G, T23018C, T28297C, A28447G, C29386T
USHER Tree
The Usher tree looks as if it has two very separate branches, but this is an artifact from the very low spike coverage in most of the Indian sequences here. The branches in the lower section of the tree consist almost entirely of artifactual reversions. Similarly, all the sequences that appear to lack S:E180V merely lack coverage there and therefore almost certainly possess it.
https://nextstrain.org/fetch/raw.githubusercontent.com/ryhisner/jsons/main/XBB.1_Lineage_20_seq_Tree_subtreeAuspice1_genome_1cd4f_28ede0.json
Evidence
This saltation lineage has already spread quite widely across the globe, but of the non-Indian sequences with adequate metadata about travel history, almost all indicate international travel, mostly from India. One USA sequence lists travel history from Ethiopia, two with India, and the rest do not specify a country (but are sequenced by Gingko Bioworks, which only sequences incoming international travelers). All six sequences from Singapore have travel history in India. Sequencing in India has been rather sparse of late, so this may comprise a substantial fraction of infections there, particularly given it was first sequenced on January 23.
S:K478R has been present in a few smaller lineages (CM.4.1, BA.2.38.3) and regularly appears in scattered sequences here and there. ORF1a:L3829F is of course found in all BQ* sequences, but it is also one of the most convergent ORF1a mutations found in chronic-infection mutations. ORF9b:I5T (T28297C) is in XBB.1.9 and has been posited to be the reason XBB.1.9 lineages seem to grow somewhat faster than XBB.1.5. ORF9b has been implicated in immune evasion, primarily interferon suppression I think, so it's possible ORF9b:N55S could confer some further resistance to immunity. Both of these ORF9b mutations are synonymous in N.
Genomes
Genomes
EPI_ISL_16835403, EPI_ISL_16940118, EPI_ISL_17012463, EPI_ISL_17012465, EPI_ISL_17012469, EPI_ISL_17016337, EPI_ISL_17016347, EPI_ISL_17020434, EPI_ISL_17024073, EPI_ISL_17029900, EPI_ISL_17029986, EPI_ISL_17030006, EPI_ISL_17030031, EPI_ISL_17030043, EPI_ISL_17032330, EPI_ISL_17048705, EPI_ISL_17066648, EPI_ISL_17066668, EPI_ISL_17073035-17073036, EPI_ISL_17073038-17073041, EPI_ISL_17073047, EPI_ISL_17073050, EPI_ISL_17073054, EPI_ISL_17073059, EPI_ISL_17073061-17073064, EPI_ISL_17076689, EPI_ISL_17078570-17078572, EPI_ISL_17078574-17078577, EPI_ISL_17078591, EPI_ISL_17084712The text was updated successfully, but these errors were encountered: