Welcome to the GitHub repository for the following publication: Mapping the energetic and allosteric landscapes of protein binding domains (Faure AJ, Domingo J & Schmiedel JM et al., 2022)
Here you'll find an R package with all scripts to reproduce the figures and results from the computational analyses described in the paper.
- 1. Required Software
- 2. Installation Instructions
- 3. Required Data
- 4. Pipeline Modes
- 5. Pipeline Stages
To run the doubledeepms pipeline you will need the following software and associated packages:
- R >=v3.6.1 (bio3d, Biostrings, coin, Cairo, data.table, ggplot2, GGally, hexbin, plot3D, reshape2, RColorBrewer, ROCR, stringr, ggrepel)
The following software is optional:
- Python v3.8.6 (pandas, numpy, matplotlib, tensorflow, scikit-learn)
- DiMSum v1.2.8 (pipeline for pre-processing deep mutational scanning data i.e. FASTQ to fitness)
Open R and enter:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("lehner-lab/doubledeepms")
# Load
library(doubledeepms)
# Help
?doubledeepms
Fitness scores, thermodynamic models, pre-processed data and required miscellaneous files should be downloaded from here and unzipped in your project directory (see 'base_dir' option) i.e. where output files should be written.
There are a number of options available for running the doubledeepms pipeline depending on user requirements.
Default pipeline functionality ('startStage' = 1) uses prefit thermodynamic models and fitness scores from DMS experiments (already processed with MoCHI and DiMSum respectively; see Required Data) to reproduce all figures in the publication.
Pipeline stage 0 ('doubledeepms_fit_thermo_model') fits thermodynamic models to DMS data for the specified domains ('tmodel_protein'), using all available data or subsets of phenotypes/variants ('tmodel_subset'). Parallel computing using job arrays is reccommended while running monte carlo simluations to determine confidence intervals of model-inferred free energies ('tmodel_job_number'). Note: this stage can be resource intensive (up to 48h with 30GB of RAM for GB1).
Raw read processing is not handled by the doubledeepms pipeline. FastQ files (GSE184042) from paired-end sequencing of replicate deep mutational scanning (DMS) libraries before ('input') and after selection ('output') were processed using DiMSum (Faure and Schmiedel et al., 2020).
DiMSum command-line arguments and Experimental design files required to obtain variant counts from FastQ files are available here.
The top-level function doubledeepms() is the recommended entry point to the pipeline and by default reproduces the figures and results from the computational analyses described in the following publication: Mapping the energetic and allosteric landscapes of protein binding domains (Faure AJ, Domingo J & Schmiedel JM et al., 2022). See Required Data for instructions on how to obtain all required data and miscellaneous files before running the pipeline.
This stage ('doubledeepms_fit_thermo_model') fits thermodynamic models to variant fitness data from (ddPCA) DMS.
This stage ('doubledeepms_thermo_model_results') evaluates thermodynamic model results and performance including comparing to literature in vitro measurements (related to Figure 2).
This stage ('doubledeepms_structure_metrics') annotates single mutant inferred free energies with PDB structure-derived metrics.
This stage ('doubledeepms_fitness_plots') plots fitness distributions and scatterplots (related to Figure 1).
This stage ('doubledeepms_fitness_heatmaps') plots single mutant fitness heatmaps (related to Figure 1).
This stage ('doubledeepms_free_energy_scatterplots') plots single mutant free energy scatterplots (related to Figure 3).
This stage ('doubledeepms_free_energy_heatmaps') plots single mutant free energy heatmaps (related to Figure 3).
This stage ('doubledeepms_protein_stability_plots') produces protein stability plots (related to Figure 4).
This stage ('doubledeepms_interface_mechanisms') produces binding free energy heatmaps for selected GRB2-SH3 residues (related to Figure 5).
This stage ('doubledeepms_allostery_plots') produces allostery plots (related to Figure 5 and Figure 6).
This stage ('doubledeepms_allostery_scatterplots') produces free energy scatterplots of major allosteric sites and mutations (related to Figure 5 and Figure 6).
This stage ('doubledeepms_downsampling_analysis') evaluates thermodynamic model results and performance after downsampling (related to Figure 2).
This stage ('doubledeepms_foldx_comparisons') compares inferred folding free energy changes to those predicted by FoldX.
This stage ('doubledeepms_polyphen2_comparisons') compares inferred folding free energy changes to PolyPhen2 predictions of functional effects.
This stage ('doubledeepms_3did_comparisons') tests the enrichment of allosteric mutations at interaction interfaces as annotated by the database of three-dimensional interacting domains (3did).
This stage ('doubledeepms_eve_comparisons') compares inferred folding free energy changes to EVE predictions of functional effects.