Analysis scripts for processing Abeta deep mutational scanning (DMS) data.
To run the abetadms pipeline you will need the following software and associated packages:
- R >=v3.5.2 (Biostrings, caTools, corpcor, cowplot, data.table, gdata, ggplot2, GGally, hexbin, lemon, optparse, parallel, pdist, plyr, ppcor, raster, reshape2, Rpdb, RColorBrewer)
The following packages are optional:
- DiMSum (pipeline for pre-processing deep mutational scanning data i.e. FASTQ to counts)
- DMS2structure (scripts used for epistasis and structure analysis of deep mutational scanning data in Schmiedel & Lehner, bioRxiv 2018)
Open R and enter:
# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("lehner-lab/abetadms")
# Load
library(abetadms)
# Help
?abetadms
DiMSum fitness estimates and required miscellaneous files should be downloaded from here to your project directory (see 'base_dir' argument) i.e. where output files should be written, and unzipped.
There are a number of options available for running the abetadms pipeline depending on user requirements.
Default pipeline functionality uses DiMSum fitness estimates (see 'Required Data'). Neither DiMSum nor DMS2structure packages are required for this default functionality.
Raw read processing is not handled by the abetadms pipeline. FastQ files from paired-end sequencing of replicate deep mutational scanning (DMS) libraries before ('input') and after selection ('output') were processed using DiMSum (manuscript in prep.), an R package that wraps common biological sequence processing tools.
Pipeline stage 1 ('abetadms_preprocess_fitness') reformats DiMSum files and re-estimates fitness of doubles mutants using a bayesian framework ('bayesian_double_fitness = T'). The latter is computationally intensive (~30minutes on 10 cores) and is therefore not run by default.
Pipeline stage 9 ('abetadms_epistasis_analysis') performs epistasis calculations. This stage is computationally intensive (~30minutes on 10 cores) and is therefore not run by default. Note: 'Required Data' (see above) already includes precomputed results of the epistasis analysis. However, to force re-execution of this stage set 'rerun_epistasis = T'. Additionally, the correct path to your local copy of the DMS2structure repository must be specified with 'DMS2structure_path = MY_LOCAL_PATH'.
Pipeline stage 10 ('abetadms_secondary_structure_predictions') performs secondary structure predictions. Secondary structure predictions are computationally intensive and are therefore not re-run by default. Note: 'Required Data' (see above) already includes precomputed results of the structure analyses. To force re-execution set 'rerun_structure = T'. Additionally, the correct path to your local copy of the DMS2structure repository must be specified with 'DMS2structure_path = MY_LOCAL_PATH'.
The top-level function abetadms() is the recommended entry point to the pipeline. See section on "Required Data" above for instructions on how to obtain all required data and miscellaneous files before running the pipeline.
This stage ('abetadms_preprocess_fitness') reformats DiMSum files and re-estimates fitness of doubles mutants using a bayesian framework ('bayesian_double_fitness = T'). The latter is computationally intensive (~30minutes on 10 cores) and is therefore not run by default.
This stage ('abetadms_quality_control') produces quality control plots of fitness estimates.
This stage ('abetadms_combine_fitness') performs normalisation of fitness estimates (based on silent mutants), fitness distribution plots and position-wise fitness plots.
This stage ('abetadms_aa_properties_mutant_effects') performs principal component analysis (PCA) of a curated collection of numerical indices representing various physicochemical and biochemical properties of amino acid (AA) properties. AA property feature values represent the difference between the WT and mutant PC scores.
This stage ('abetadms_agg_tools_mutant_effects') calculates aggregation / disorder algorithm feature values for single and double mutant variants (similar to stage 4).
This stage ('abetadms_single_mutant_heatmaps') produces single mutant heatmaps of fitness effects.
This stage ('abetadms_human_disease_mutations') tests whether human disease mutations have biased fitness estimates.
This stage ('abetadms_fitness_model_summary') produces plots of results from simple linear regression models to predict variant fitness.
This stage ('abetadms_epistasis_analysis') performs epistasis calculations. This stage is computationally intensive (~30minutes on 10 cores) and is therefore not run by default. To force re-execution of this stage set 'rerun_epistasis = T'. Additionally, the corect path to your local copy of the DMS2structure repository must be specified with 'DMS2structure_path = MY_LOCAL_PATH'.
This stage ('abetadms_secondary_structure_predictions') performs secondary structure predictions and produces combined summary plots. Secondary structure predictions are computationally intensive and are therefore not re-run by default. To force re-execution of secondary structure predictions set 'rerun_structure = T'. Additionally, the corect path to your local copy of the DMS2structure repository must be specified with 'DMS2structure_path = MY_LOCAL_PATH'.
This stage ('abetadms_PWI_heatmaps') plots pair-wise interaction (PWI) score heatmaps.