A computational framework for identifying epistatically linked sets of SAV alleles and merging them into haplotypes using statistical inference, population genetics, and graph theory.
HELEN (Heralding Emerging Lineages in Epistatic Networks): a computational framework for inference of viral variants as dense communities in epistatic networks
- Matlab
- Gurobi
The main script is
[vocInfer, vocSupport] = HELEN_run(fastafile,fastaref,k_min,k_max,nSol,timeLimit,outdirRes,outToken)
input:
fastafile
- fasta file with aligned viral sequencesfastaref
- fastafile with the referencek_min
- minimal size of a densest candidate subgraph to be generated by HELENk_max
- maximal size of a densest candidate subgraph to be generated by HELENnSol
- the number of densest subgraphs of a given size k (k_min <= k <= k_max) that are not contained in densest subgraphs of higher sizes generated by HELENtimeLimit
- time limit (in seconds) for each ILP solver execution by HELENoutdirRes
- output directory, where intermediate HELEN results are saved. Set outdirRes = [], if saving of intermediate results is not requiredoutToken
- an identifier to be attached to each saved output file (can be set if outdirRes ~= [])
output:
vocInfer
- a cell array of inferred genomic variants. Each variant is represented by an array of genomic positions defining this variantvocSupport
- vector of support values for inferred variants
Example: [vocInfer, vocSupport] = HELEN_run('myData.fas','ref.fas',7,27,100,20000,'HELEN_results','myData')
For more information, see "Mohebbi, Zelikovsky, Mangul, Chowell, Skums, Community structure and temporal dynamics of SARS-CoV-2 epistatic network allows for early detection of emerging variants with altered phenotypes"
In addition, the repository contains the following data and scripts performing specific subroutines:
-
E = constructEpisNetwork(M,rho)
: a script that construct epistatic networks from genomic datainput:
M
- mutation matrixrho
- p-value for detection of epistatically linked pairs
output:
E
- list of edges of the constructed epistatic network
-
[varInfer, vocSupport] = HELEN_infer(G,k_min,k_max,nSol,timeLimit,outdirRes,outToken)
: a script that infers viral variants from the epistatic network Ginput:
G
- epistatic network (as a graph object)k_min
,k_max
,nSol
,timeLimit
,outdirRes
,outToken
- see above
output: see above
-
run_analysis
: a script that generates the data and analysis results used in the paper "Mohebbi, Zelikovsky, Mangul, Chowell, Skums, Community structure and temporal dynamics of SARS-CoV-2 epistatic network allows for early detection of emerging variants with altered phenotypes" -
collectResDrawPlotsSamp
,collectResDrawPlotsDensest
,collectResDrawPlotsInfer
: scripts that generate plots with the analysis results for sampling-based p-values, densest subnetworks, and haplotype inference used in the paper -
HELEN data : secondary data generated for the paper
Genomic data and associated metadata analyzed in this study were obtained from GISAID1. SARS-CoV-2 epistatic networks and data derived from them can be downloaded from the following links, the links contain data from the complete dataset, the first truncated dataset, and the second truncated dataset respectively:
Mohebbi F, Zelikovsky A, Mangul S, Chowell G, Skums P. Early detection of emerging viral variants through analysis of community structure of coordinated substitution networks. Nat Commun. 2024 Apr 2;15(1):2838. doi: 10.1038/s41467-024-47304-6. PMID: 38565543; PMCID: PMC10987511.
https://pubmed.ncbi.nlm.nih.gov/38565543/
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. The provided GISAID supplemental table includes a DOI where you can find all the associated authors and their originating laboratories.
Footnotes
-
Khare, S., et al (2021) GISAID’s Role in Pandemic Response. China CDC Weekly, 3(49): 1049-1051. doi: 10.46234/ccdcw2021.255 PMCID: 8668406 ↩