This is the repo for the paper: "Improving generalisability of 3D binding affinity models in low data regimes". The repo include code to reproduce the experiments reported.
The repo uses molflux
and physicsml
for the model code and dvc
for structuring experiments. The repository contains multiple pipelines
each corresponding to part of the benchmarks.
The code for running the experiments is under src
and the stage definitions are under pipelines
. Apart from the dataset processing,
the model training pipelines share almost all code which can be found under src/low_sim_pdbbind/stages
. The code is split into
6 stages
fetch_data
: Fetches the data from a predefined locationfiltering
: Filter part of the data (for example data for a single protein)higher_split
: The train+validation / test split (sometimes called holdout)lower_split
: the train / validation split.train
: Trains the models on all the split foldsmetrics
: Computes the metrics on all the split folds.
The data is available at zenodo. You can find:
pdbbind_dataset.csv
: The csv file of pdb codes, affinity data, etc.pdbbind_ccdc_structures.gz
: The prepared structures.pdbbind_ccdc_lignads.gz
: The prepared ligands.aln.txt
: The similarity data computed using Foldseek.qm_egnn
: The pre-trained QM EGNN modeldiffusion_egnn
: The pre-trained diffusion EGNN model
There are 4 pipelines which are defined in the pipelines
directory.
pdb_processing
: Gets the prepared structures and constructs a dataset. It also filters for similarity and makes the splits.ligand_only_2d
: Trains the models that use 2d ligand information only (both local and global).ligand_pocket_3d
: Trains the models which use the 3d ligand and pocket information (The EGNN models).durant_models
: Trains models from Durant et al., 2023.
To set up the environment, you can run ./init_conda_venv.sh
. This will set up a conda env with the base dependencies.
Next, you will need to install the optional dependencies for each pipeline by doing pip install .[PIPELINE_NAME]
. To install
the dependencies for all the pipelines, you can do pip install .[all]
.
To run the benchmarks, you need to download the data from zenodo.
To run the pdb_processing
pipeline, you need to specify the following paths in the src/low_sim_pdbbind/pipelines/pdb_processing/config/main.yaml
file to
pdbbind_dataset_path
to point topdbbind_dataset.csv
structures_path
to point to the unzippedpdbbind_ccdc_structures
ligands_path
to point to the unzippedpdbbind_ccdc_ligands
path_to_foldseek_aln
to point toaln.txt
To run the ligand_pocket_3d
pipeline with the pre-trained models, you need to specify the paths to the pre-trained models.
For the QM pre-trained model, you need to specify model_config.config.transfer_learning.pre_trained_model_path
to point to
qm_egnn
in these files src/low_sim_pdbbind/pipelines/ligand_pocket_3d/config/train/*_pre_trained_qm.yaml
For the diffusion pre-trained model, you need to specify model_config.config.transfer_learning.pre_trained_model_path
to point to
diffusion_egnn
in these files src/low_sim_pdbbind/pipelines/ligand_pocket_3d/config/train/*_pre_trained_diffusion.yaml
Make sure all paths you specify are absolute paths! (no ~
!)
Each pipeline has pre-defined configs which can be found in src/low_sim_pdbbind/pipelines/*/configs
. The shared configs
can be found in src/low_sim_pdbbind/configs
(for the dataset and the splits).
The default settings for each pipeline can be run by executing dvc exp run
in the directory of that pipeline found under pipelines/*/
.
To override the default params with any configs, you can follow dvc
convention and do dvc exp run -S param_name=config_name
. Each
pipeline has an hpo.sh
script which does a grid search over the available parameters.
cd pipelines/pdb_processing/
dvc exp run
The command above should create following data in the directory:
data/data_processed.parquet
data/dataset_high_split.parquet
The data stored in data/dataset_high_split.parquet
is the input-data of the other pipelines.
Make sure you installed the pipeline-specific dependencies with: pip install .[ligand_only_2d]
The path to the processed and split dataset (pipelines/pdb_processing/data/dataset_high_split.parquet
) generated by the pdb_processing
-pipeline
is already set in the dataset config
src/low_sim_pdbbind/config/dataset/pdb_bind_bespoke_ccdc.yaml
.
You can run a single ligand-only-2d experiment:
cd pipelines/ligand_only_2d/
dvc exp run -S dataset=pdb_bind_bespoke_ccdc -S filtering=uniprot_id -S filtering.0.value=O60885 -S higher_split.presets.columns=[by_bespoke_5_fold_0,by_bespoke_5_fold_1,by_bespoke_5_fold_2] -S featurisation=ECFPMD -S train=catboost_regressor
To run the full grid search:
cd pipelines/ligand_only_2d/
./hpo.sh
This will sequentially run all dvc-commands to iterate over datasets, features and models.
Once you have run the experiments, you can aggregate the results using the results_summary.ipynb
notebook for each pipeline. Start
by exporting the dvc experiments. In the pipeline directory (and using the commit hash of the commit from which the experiments were run), run
dvc exp show --rev <commit hash> --csv > summary.csv
This will generate a csv
with all the experiments. You can then run the respective results_summary.ipynb
notebook which will produce
an aggregated results summary.
model | pipeline name | implementation source | reference |
---|---|---|---|
EGNN | ligand_pocket_3d |
own | [Satorras et al., 2022] |
EGNN-QM | ligand_pocket_3d |
own | |
EGNN-diffusion | ligand_pocket_3d |
own | |
RF-Score | durant_models |
https://github.com/guydurant/toolboxsf | [Ballester and Mitchell, 2010] |
OnionNet-2 | durant_models |
reimplemented | [Wang et al., 2021] |
single-protein | ligand_only_2d |
own |