Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequences for 46 underrepresented Omicron lineages #1774

Merged

Conversation

AngieHinrichs
Copy link
Member

I noticed that some important Omicron lineages have very few designated sequences, for example BQ.1.1 has only two designated sequences while the UShER tree has over 160000 BQ.1.1 sequences (excluding sublineages). BF.7.7 has only one designated sequence despite the UShER tree having over 1000 sequences on the BF.7.7 branch.

I ran a comparison of designated sequence counts (from lineages.csv) vs. UShER tree assignment counts for Omicron lineages and found 46 lineages with >100x fewer designated sequences than tree-assigned sequences. This PR adds randomly selected sequences from those 46 lineages from a quality-filtered tree (excluding sequences with two or more reversions relative to lineage root, and branches with a lot of mutations for few sequences (matUtils extract --max-mutation-density 2)) that are also found in a list of quality-filtered sequences from @InfrPopGen. Along the way I made several corrections to sequences already in lineages.csv:

  • Shortly after BF.7.7 was designated in b706220, commit 4cc824a added BE.1.4 but also inadvertently replaced all but one of the BF.7.7 designations with BE.1.4 designations -- that's how BF.7.7 came to have only one designated sequence. I changed those back from BE.1.4 to BF.7.7 in d49f27e before adding more sequences for BF.7.7.
  • 30 sequences were designated BQ.1.1.69 but are really BQ.1.18 (unless I have outdated sequences?) so I changed those in 31af5ce
  • 48 sequences were still designated BA.5.2.1 but should have been updated to BF.7 so I did that in 4805c1b
  • 5 sequences should have been updated from BA.5.2 to BA.5.2.6 so I did that in bccb740

The scripts deduplicate_keeping_first.py and deduplicate_keeping_last.py were very handy for this, but they both add an extra blank line at the end of the rewritten lineages.csv so I fixed that in cb0eeb4.

There are still some lineages with >90x fewer designated sequences than big tree sequences, but this is already a lot of changes. Also, github is complaining about the size of lineages.csv, 74.9MB after this PR.

This exercise was also useful for finding several lineage annotations in the UShER tree that need to be shifted up or down by a node or three.

…ommit b706220 added 111 BF.7.7 designations, but then 4cc824a wiped out all but one of them, redesignating them as BE.1.4 and adding some more sequences as BE.1.4 that should be BF.7.7.
@AngieHinrichs AngieHinrichs force-pushed the addSequences_2023-03-16 branch from 1f9ee66 to b487aa6 Compare March 20, 2023 20:19
@AngieHinrichs AngieHinrichs merged commit 15464b9 into cov-lineages:master Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant