Skip to content

Gene Novelty Unit-based Virus Identification for SARS-CoV-2

License

Notifications You must be signed in to change notification settings

ahmedmagds/GNUVID

Repository files navigation

License: GPL v3 Build Status Anaconda_cloud Anaconda_install DOI

GNUVID

Gene Novelty Unit-based Virus IDentification for SARS-CoV-2

Introduction

GNUVID (GNU-based Virus IDentification) is a Python3 program. It ranks CDS nucleotide sequences in a genome fna file based on the number of observed exact CDS nucleotide matches in a public or private database. It was created to type SARS-CoV-2 genomes using a whole genome multilocus sequence typing (wgMLST) approach. The 10 ORFs (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) in SARS-CoV-2 are used for typing. It automatically assigns allele numbers to each of the 10 ORFs and a Sequence Type (ST) to each genome, based on its profile of unique gene allele sequences. It is based on our recent panallelome approach implemented in WhatsGNU. The STs are then clustered into bigger groups which are designated clonal complexes (CCs) based on their grouping on a minimum spanning tree (MST). The CCs are more granular than a Pango Lineage. It can type your query genome in seconds. As of GNUVID v2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.

GNUVID is now published Moustafa AM and Planet PJ 2021. Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing. Genome Biology and Evolution;13(9):evab197

We acknowledge the open-science of the individual research labs and public agencies that have made their SARS-CoV-2 genomes available on GISAID.

Install and use as simple as

Make a new environment and install GNUVID in it

conda create -n GNUVID -c bioconda gnuvid
conda activate GNUVID

Globally circulating clonal complexes as of 2021-08-31:

  • 1,392,002 High Quality GISAID sequences have been included in this analysis.

  • GNUVID compressed the 13920020 ORFs in the 1392002 genomes to 755489 unique alleles.

  • 731164 Sequence Types (STs) have been assigned in this dataset and were clustered in 4084 clonal complexes (CCs).

  • 1196 new CCs have been assigned (2888 CCs in Jun 2021 to 4084 in Aug 2021).

  • 3123 CCs have been Inactive (i.e. Last time seen more than 1 month before 2021-08-31).

  • 397 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2021-08-31).

  • 564 CCs have been Active (i.e. Last seen within the 2 weeks before 2021-08-31).

GNUVID now reports the WHO Naming system for VOCs/VOIs/VUMs (e.g. Alpha, Beta..etc) as per the WHO updated on 10/22/2021:

  • 1597 CCs representing the Alpha VOC (a.k.a. B.1.1.7 and descendant Q.* lineages).

  • 27 CCs representing the Beta VOC (a.k.a. B.1.351 and descendant lineages).

  • 117 CCs representing the Gamma VOC (a.k.a. P.1 and descendant lineages).

  • 777 CCs representing the Delta VOC (a.k.a. B.1.617.2 and descendant AY.* lineages).

  • 6 CC representing the Lambda VOI (a.k.a. C.37).

  • 6 CCs representing the Mu VOI (a.k.a. B.1.621).

  • 225 CCs representing the 16 lineages (B.1.427/429, R.1, C.1.2, B.1.466.2, B.1.1.318, B.1.1.519, B.1.1.523, C.36.3, B.1.525, B.1.526, B.1.619, B.1.620, B.1.630, B.1.617.1 and B.1.214.2) that are currently designated Variants Under Monitoring (VUM) by WHO for Further Monitoring.

  • The remaining 1329/4084 CCs are not designated VOC/VOI/VUM by WHO (10/22/2021).

A table showing summary information of the 564 Active Clonal Complexes (CCs) can be found here. A full report for the 4084 CCs can be found here

Installation

Dependencies

Bioconda (recommended)

If you use Conda you can use the Bioconda channel to install it in the conda base: Make a new environment and install GNUVID in it

conda create -n GNUVID -c bioconda gnuvid
conda activate GNUVID

The 'conda activate' command is needed to activate the GNUVID environment each time you want to use the tool.
If you do not have Miniconda or Anaconda installed already, you can install one of them from:

  1. Miniconda
  2. Anaconda

OR

Clone the Github repository

GNUVID is a command-line application written in Python3. Simply download and use! You will have to install dependencies!

$git clone https://github.com/ahmedmagds/GNUVID
$cd GNUVID/bin
$chmod +x *.py
$pwd
#pwd will give you a path/to/folder/having/GNUVID which you will use in next command
$export PATH=$PATH:/path/to/folder/having/GNUVID/bin

If you need it permanently, you can add this last line to your .bashrc or .bash_profile.

Test

  • Type GNUVID_Predict.py -h and it should output help screen.
  • Type GNUVID_Predict.py -v and you should see an output like GNUVID.py v2.4.

Usage for GNUVID_Predict.py

Input

  1. Query whole genome FASTA file (.fna) (it can have multiple genomes as separate FASTA records).

Simple

GNUVID_Predict.py will use exact matching to identify alleles of the 10 ORFs. If any novelty or ambiguity seen, Random Forest Classifier is used to classify your new genome to one of the Clonal complexes (CC))

$GNUVID_Predict.py new_genomes.fasta

Use with more options

$GNUVID_Predict.py -i -o new_genomes_GNUVID new_genomes.fasta

Command line options

usage: GNUVID_Predict.py [-h] [-o OUTPUT_FOLDER] [-m MIN_LEN] [-n N_MAX] [-b BLOCK_PRED] [-e] [-i] [-f] [-q] [-v] query_fna

GNUVID v2.4 uses the natural variation in public genomes of SARS-CoV-2 to rank
gene sequences based on the number of observed exact matches (the GNU score)
in all known genomes of SARS-CoV-2. It assigns a sequence type to each genome
based on its profile of unique gene allele sequences. It can type (using whole
genome multilocus sequence typing; wgMLST) your query genome in seconds.
GNUVID_Predict is a speedy algorithm for assigning Clonal Complexes to new
genomes, which uses machine learning Random Forest Classifier, implemented as
of GNUVID v2.0.

positional arguments:
  query_fna             Query Whole Genome Nucleotide FASTA file to analyze
                        (.fna)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Output folder and prefix to be created for results (default: timestamped GNUVID_results in the current directory)
  -m MIN_LEN, --min_len MIN_LEN
                        minimum sequence length [Default: 15000]
  -n N_MAX, --n_max N_MAX
                        maximum proportion of ambiguity (Ns) allowed [Default: 0.5]
  -b BLOCK_PRED, --block_pred BLOCK_PRED
                        prediction block size, good for limited memory [Default: 1000]
  -e, --exact_matching  turn off exact matching (no allele will be identified for each ORF) and only use machine learning prediction
                        [default: False]
  -i, --individual      Individual Output file for each genome showing the allele sequence and GNU score for each gene allele
  -f, --force           Force overwriting existing results folder assigned with -o (default: off)
  -q, --quiet           No screen output [default OFF]
  -v, --version         print version and exit

Output

Always

GNUVID_results_date_time.csv (csv file, specify different name using -o option)

Sequence ID GNUVID DB Version ORF1ab Surface_glycoprotein ORF3a Envelope_protein Membrane_glycoprotein ORF6 ORF7a ORF8 Nucleocapsid_phosphoprotein ORF10 Exact ST First Country seen First date seen Last country seen Last date seen CC probability WHO Naming Quality Check
isolate_x 06/21/21 4 1 1 1 1 1 1 1 1 1 4 China 2019-12-30 India 2020-08-12 4 Exact NA passed
isolate_y 06/21/21 None None None None None None None None None None None None None None None None None None failed (seq_len:4)
isolate_z 06/21/21 None 1 1 1 1 1 1 1 1 1 None NA NA NA NA 292115 0.8 Delta passed
  • Column 1: Query Sequence name
  • Column 2: GNUVID Database version (results will vary as more genomes are added to the DB)
  • Columns 3-12: The allele numbers for the 10 ORFs (If None, it means the allele was not seen in the database but has degenerate bases (N) so cannot be called novel)
  • Column 13: ST
  • Column 14: First Country where the ST was seen (only if exact)
  • Column 15: First Date when the ST was seen (only if exact)
  • Column 16: Last Country where the ST was seen (only if exact)
  • Column 17: Last Date when the ST was seen (only if exact)
  • Column 18: Clonal Complex (CC) assigned
  • Column 19: Probability of the assignment (if exact, it means this is an exact match to a previous genome in the database)
  • Column 20: WHO Naming will be reported if isolate belongs to VOCs/VOIs/Alerts as designated by WHO
  • Column 21: Quality check before prediction (passed or failed (reason))

GNUVID_date_time.log (Log file, e.g. GNUVID_20200607_170457.log)

Optional with -i

Genome1.csv (csv output file) GNUVID DB Version

Query Gene GNUVID DB Version GNU score length sequence Ns count Allele number First date seen Last date seen
isolate_x_ORF1ab 10/20/20 2000 21290 ATGTAA 0 1 2019-12-24 2020-05-04
isolate_x_ORF10 10/20/20 0 117 ATGTAA 0 Novel NA NA
  • Column 1: Query Gene name
  • Column 2: GNUVID Database version (results will vary as more genomes are added to the DB
  • Column 3: GNU score (number of exact matches in the database, GNU=0 novel allele never seen before)
  • Column 4: Query gene sequence length
  • Column 5: Gene sequence
  • Column 6: Number of Ns and degenerate bases in the query gene sequence
  • Column 7: Alelle number from the database (If None, it means the allele was not seen in the database but has degenerate bases (N) so cannot be called novel)
  • Column 8: First date this allele was seen (NA if novel)
  • Column 9: Last date this allele was seen (NA if novel)

Note: This report should have 10 rows for the ORFs. It will be produced for each genome. It is valuable if you interested to know more about each ORF allele and how many times it was seen globally (GNU score) and when it was first- and last- time seen.

Instructions for how to use GNUVID.py for compression and classification here

Bugs

Please submit via the GitHub issues page: https://github.com/ahmedmagds/GNUVID/issues

Software Licence

GPLv3: https://github.com/ahmedmagds/GNUVID/blob/master/LICENSE

Source Data

The data used in GNUVID is from GISAID, but sequences were anonymized to fit with guidelines. Appropriate acknowledgements for the labs that provided the original SARS-CoV-2 genome sequences to GISAID are also provided here

Citations

GNUVID

Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing
Moustafa AM and Planet PJ 2020, bioRxiv;2020.12.28.424582
Rapid whole genome sequence typing reveals multiple waves of SARS-CoV-2 spread
Moustafa AM and Planet PJ 2020, bioRxiv;2020.06.08.139055

References

  • WhatsGNU 'Moustafa AM and Planet PJ 2020, Genome Biology;21:58'.
  • MAFFT version 7 'Katoh and Standley 2013, Molecular Biology and Evolution;30:772-780'.
  • pandas 'Reback et al. 2020, DOI:10.5281/zenodo.3509134'.
  • minimap2 'Li H 2018, Bioinformatics; 34:18'.
  • gofasta 'https://github.com/cov-ert/gofasta'
  • Scikit-learn 'Pedregosa et al. 2011, JMLR; 12:2825-2830'.
  • BLAST+ 'Camacho et al. 2009, BMC Bioinformatics; 10:421'.
  • GISAID 'Shu Y. and McCauley J. 2017, EuroSurveillance; 22:13'.
  • The reference genome MN908947 'Wu et al. 2020, Nature; 579:265–269'.
  • eBURST 'Feil et al. 2004, Journal of Bacteriology; 186:1518'.
  • goeBURST 'Francisco et al. 2009, BMC Bioinformatics; 10:152'.
  • PHYLOViZ 2.0 'Nascimento et al. 2017, Bioinformatics; 33:128-129'.

Author

Ahmed M. Moustafa: ahmedmagds
Twitter: Ahmed_Microbes

About

Gene Novelty Unit-based Virus Identification for SARS-CoV-2

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages