GitHub repository accompanying the paper Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes Rodriguez-Rivas, Giancarlo Croce, Maureen Muscat, Martin Weigt, PNAS January 25, 2022 119 (4) e2113118119; https://doi.org/10.1073/pnas.2113118119.
Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes
For each site of the SARS-CoV-2 Wuhan-Hu-1 proteome (Accession NC_045512) included in a protein domain, we introduce a Direct Coupling Analysis (DCA) mutability score to predict mutable and constrained sites.
Sites of the SARS-Cov-2 Wuhan-Hu-1 proteome with high DCA-Mutability Scores are more likely to mutate. We validate our mutability predictions with the mutations observed in SARS-CoV-2 proteomes deposited in the GISAID database.
Run the Jupyter-notebook dca_sarscov2.ipynb
on Google Colab at this link to reproduce key results from the paper and guide the data analysis. Colab allows you to execute the Python code through your browser (no need to clone the GitHub repository on your local machine).
The code for training sequence-based models to predict mutability scores is available at https://github.com/juan-rodriguez-rivas/covmut.
The results of our analysis are also available in the ./data
folder.
The data structure is:
protein domain position_protein position_domain aa_Wuhan-Hu-1 mutability_score(IND) mutability_score(DCA) observed_mut_May2021 observed_mut_Dec2020 observed_mut_Jul2020
Spike bCoV_S1_RBD 349 1.0 S -1.3818 -1.2046 0.0 0.0 0.0
Spike bCoV_S1_RBD 350 2.0 V -1.9788 -1.1667 3.0 0.0 0.0
Spike bCoV_S1_RBD 351 3.0 Y -1.8017 -1.2678 8.0 1.0 1.0
Spike bCoV_S1_RBD 352 4.0 A -1.0140 -1.1347 23.0 3.0 0.0
We combine our DCA-Mutability Score with the IEDB site response frequency (IEDB-Response Frequency). This allows us to identify mutable (high DCA-Mutability Score) and immunologically relevant sites (high IEDB - Response Frequency). Amino acid substitutions at high-RF sites have a higher risk of inducing immune escape as many positively responding B and T cell epitopes are modified.
A key advantage of our data-driven modeling approach is the possibility to obtain predictions for all the protein domains in the SARS-CoV-2 proteome. Run the Jupyter-notebook to extend the DCA predictions to all 39 protein domains covering 81% of the entire proteome (8037 out of 9748 positions), and combine them with immunological IEDB data.