Gene Novelty Unit-based Virus IDentification for SARS-CoV-2
GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in WhatsGNU. The STs are then clustered into bigger groups which are designated clonal complexes (CCs) based on their grouping on a minimum spanning tree (MST). The CCs are more granular than a Pango Lineage. It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.
GNUVID is now published Moustafa AM and Planet PJ 2021. Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing. Genome Biology and Evolution;13(9):evab197
We acknowledge the open-science of the individual research labs and public agencies that have made their SARS-CoV-2 genomes available on GISAID.
Make a new environment and install GNUVID in it
conda create -n GNUVID -c bioconda gnuvid
conda activate GNUVID
-
1,392,002 High Quality GISAID sequences have been included in this analysis.
-
GNUVID compressed the 13920020 ORFs in the 1392002 genomes to 755489 unique alleles.
-
731164 Sequence Types (STs) have been assigned in this dataset and were clustered in 4084 clonal complexes (CCs).
-
1196 new CCs have been assigned (2888 CCs in Jun 2021 to 4084 in Aug 2021).
-
3123 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-08-31).
-
397 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-08-31).
-
564 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-08-31).
GNUVID now reports the WHO Naming system for VOCs/VOIs/VUMs (e.g. Alpha, Beta..etc) as per the WHO updated on 10/22/2021:
-
1597 CCs representing the Alpha VOC (a.k.a. B.1.1.7 and descendant Q.* lineages).
-
27 CCs representing the Beta VOC (a.k.a. B.1.351 and descendant lineages).
-
117 CCs representing the Gamma VOC (a.k.a. P.1 and descendant lineages).
-
777 CCs representing the Delta VOC (a.k.a. B.1.617.2 and descendant AY.* lineages).
-
6 CC representing the Lambda VOI (a.k.a. C.37).
-
6 CCs representing the Mu VOI (a.k.a. B.1.621).
-
225 CCs representing the 16 lineages (B.1.427/429, R.1, C.1.2, B.1.466.2, B.1.1.318, B.1.1.519, B.1.1.523, C.36.3, B.1.525, B.1.526, B.1.619, B.1.620, B.1.630, B.1.617.1 and B.1.214.2) that are currently designated Variants Under Monitoring (VUM) by WHO for Further Monitoring.
-
The remaining 1329/4084 CCs are not designated VOC/VOI/VUM by WHO (10/22/2021).
A table showing summary information of the 564 Active Clonal Complexes (CCs) can be found here. A full report for the 4084 CCs can be found here
If you use Conda you can use the Bioconda channel to install it in the conda base: Make a new environment and install GNUVID in it
conda create -n GNUVID -c bioconda gnuvid
conda activate GNUVID
The 'conda activate' command is needed to activate the GNUVID environment each time you want to use the tool.
If you do not have Miniconda or Anaconda installed already, you can install one of them from:
OR
GNUVID is a command-line application written in Python3. Simply download and use! You will have to install dependencies!
$git clone https://github.com/ahmedmagds/GNUVID
$cd GNUVID/bin
$chmod +x *.py
$pwd
#pwd will give you a path/to/folder/having/GNUVID which you will use in next command
$export PATH=$PATH:/path/to/folder/having/GNUVID/bin
If you need it permanently, you can add this last line to your .bashrc or .bash_profile.
- Type GNUVID_Predict.py -h and it should output help screen.
- Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.4.
- Query whole genome FASTA file (.fna) (it can have multiple genomes as separate FASTA records).
GNUVID_Predict.py will use exact matching to identify alleles of the 10 ORFs. If any novelty or ambiguity seen, Random Forest Classifier is used to classify your new genome to one of the Clonal complexes (CC))
$GNUVID_Predict.py new_genomes.fasta
$GNUVID_Predict.py -i -o new_genomes_GNUVID new_genomes.fasta
usage: GNUVID_Predict.py [-h] [-o OUTPUT_FOLDER] [-m MIN_LEN] [-n N_MAX] [-b BLOCK_PRED] [-e] [-i] [-f] [-q] [-v] query_fna
GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to rank
gene sequences based on the number of observed exact matches (the GNU score)
in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome
based on its profile of unique gene allele sequences. It can type (using whole
genome multilocus sequence typing; wgMLST) your query genome in seconds.
GNUVID_Predict is a speedy algorithm for assigning Clonal Complexes to new
genomes, which uses machine learning Random Forest Classifier, implemented as
of GNUVID v2.0.
positional arguments:
query_fna Query Whole Genome Nucleotide FASTA file to analyze
(.fna)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Output folder and prefix to be created for results (default: timestamped GNUVID_results in the current directory)
-m MIN_LEN, --min_len MIN_LEN
minimum sequence length [Default: 15000]
-n N_MAX, --n_max N_MAX
maximum proportion of ambiguity (Ns) allowed [Default: 0.5]
-b BLOCK_PRED, --block_pred BLOCK_PRED
prediction block size, good for limited memory [Default: 1000]
-e, --exact_matching turn off exact matching (no allele will be identified for each ORF) and only use machine learning prediction
[default: False]
-i, --individual Individual Output file for each genome showing the allele sequence and GNU score for each gene allele
-f, --force Force overwriting existing results folder assigned with -o (default: off)
-q, --quiet No screen output [default OFF]
-v, --version print version and exit
GNUVID_results_date_time.csv (csv file, specify different name using -o option)
Sequence ID | GNUVID DB Version | ORF1ab | Surface_glycoprotein | ORF3a | Envelope_protein | Membrane_glycoprotein | ORF6 | ORF7a | ORF8 | Nucleocapsid_phosphoprotein | ORF10 | Exact ST | First Country seen | First date seen | Last country seen | Last date seen | CC | probability | WHO Naming | Quality Check |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
isolate_x | 06/21/21 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | China | 2019-12-30 | India | 2020-08-12 | 4 | Exact | NA | passed |
isolate_y | 06/21/21 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | failed (seq_len:4) |
isolate_z | 06/21/21 | None | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | None | NA | NA | NA | NA | 292115 | 0.8 | Delta | passed |
- Column 1: Query Sequence name
- Column 2: GNUVID Database version (results will vary as more genomes are added to the DB)
- Columns 3-12: The allele numbers for the 10 ORFs (If None, it means the allele was not seen in the database but has degenerate bases (N) so cannot be called novel)
- Column 13: ST
- Column 14: First Country where the ST was seen (only if exact)
- Column 15: First Date when the ST was seen (only if exact)
- Column 16: Last Country where the ST was seen (only if exact)
- Column 17: Last Date when the ST was seen (only if exact)
- Column 18: Clonal Complex (CC) assigned
- Column 19: Probability of the assignment (if exact, it means this is an exact match to a previous genome in the database)
- Column 20: WHO Naming will be reported if isolate belongs to VOCs/VOIs/Alerts as designated by WHO
- Column 21: Quality check before prediction (passed or failed (reason))
GNUVID_date_time.log (Log file, e.g. GNUVID_20200607_170457.log)
Genome1.csv (csv output file) GNUVID DB Version
Query Gene | GNUVID DB Version | GNU score | length | sequence | Ns count | Allele number | First date seen | Last date seen |
---|---|---|---|---|---|---|---|---|
isolate_x_ORF1ab | 10/20/20 | 2000 | 21290 | ATGTAA | 0 | 1 | 2019-12-24 | 2020-05-04 |
isolate_x_ORF10 | 10/20/20 | 0 | 117 | ATGTAA | 0 | Novel | NA | NA |
- Column 1: Query Gene name
- Column 2: GNUVID Database version (results will vary as more genomes are added to the DB
- Column 3: GNU score (number of exact matches in the database, GNU=0 novel allele never seen before)
- Column 4: Query gene sequence length
- Column 5: Gene sequence
- Column 6: Number of Ns and degenerate bases in the query gene sequence
- Column 7: Alelle number from the database (If None, it means the allele was not seen in the database but has degenerate bases (N) so cannot be called novel)
- Column 8: First date this allele was seen (NA if novel)
- Column 9: Last date this allele was seen (NA if novel)
Note: This report should have 10 rows for the ORFs. It will be produced for each genome. It is valuable if you interested to know more about each ORF allele and how many times it was seen globally (GNU score) and when it was first- and last- time seen.
Instructions for how to use GNUVID.py for compression and classification here
Please submit via the GitHub issues page: https://github.com/ahmedmagds/GNUVID/issues
GPLv3: https://github.com/ahmedmagds/GNUVID/blob/master/LICENSE
The data used in GNUVID is from GISAID, but sequences were anonymized to fit with guidelines. Appropriate acknowledgements for the labs that provided the original SARS-CoV-2 genome sequences to GISAID are also provided here
Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing
Moustafa AM and Planet PJ 2020, bioRxiv;2020.12.28.424582
Rapid whole genome sequence typing reveals multiple waves of SARS-CoV-2 spread
Moustafa AM and Planet PJ 2020, bioRxiv;2020.06.08.139055
- WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
- MAFFT version 7 'Katoh and Standley 2013, Molecular Biology and Evolution;30:772-780'.
- pandas 'Reback et al. 2020, DOI:10.5281/zenodo.3509134'.
- minimap2 'Li H 2018, Bioinformatics; 34:18'.
- gofasta 'https://github.com/cov-ert/gofasta'
- Scikit-learn 'Pedregosa et al. 2011, JMLR; 12:2825-2830'.
- BLAST+ 'Camacho et al. 2009, BMC Bioinformatics; 10:421'.
- GISAID 'Shu Y. and McCauley J. 2017, EuroSurveillance; 22:13'.
- The reference genome MN908947 'Wu et al. 2020, Nature; 579:265–269'.
- eBURST 'Feil et al. 2004, Journal of Bacteriology; 186:1518'.
- goeBURST 'Francisco et al. 2009, BMC Bioinformatics; 10:152'.
- PHYLOViZ 2.0 'Nascimento et al. 2017, Bioinformatics; 33:128-129'.
Ahmed M. Moustafa: ahmedmagds
Twitter: Ahmed_Microbes