We recommend creating a conda environment to manage the dependencies. Assumes Anaconda installation. Clone this repository:
git clone https://github.com/gitter-lab/active-learning-drug-discovery.git
cd active-learning-drug-discovery
Setup the active_learning_dd
conda environment using the conda_env.yml
file:
conda env create -f conda_env.yml
conda activate active_learning_dd
If you do not want GPU support, you can replace conda_env.yml
with conda_cpu_env.yml
.
Finally, install active_learning_dd
with pip
:
pip install -e .
Now check the installation is working correctly by running the sample data test:
cd chtc_runners
python sample_data_runner.py \
--pipeline_params_json_file=../param_configs/sample_data_config.json \
--hyperparams_json_file=../param_configs/experiment_PstP_hyperparams/sampled_hyparams/ClusterBasedWCSelector_609.json \
--iter_max=5 \
--no-precompute_dissimilarity_matrix \
--initial_dataset_file=../datasets/sample_data/training_data/iter_0.csv.gz
You should see the following last prompt:
Finished testing sample dataset. Verified that hashed selection matches stored hash.
The datasets used in this study are: PriA-SSB target, 107 PubChem BioAssay targets, and PstP target.
The datasets will be uploaded to Zenodo in the near future.
The repository also contains a small dataset for testing: datasets/sample_data/
.
The active_learning_dd
subdirectory contains the main codebase for the iterative batched screening components.
Consult the README in that subdirectory for details.
This subdirectory contains json config files for strategies and experiments used in the thesis document. Consult the README in that subdirectory for details.
This subdirectory contains Jupyter notebooks that preprocess the datasets, debug methods, analyze the results, and produce result images.
chtc_runners/
contains runner scripts for the experiments in the thesis document.
chtc_runners/simulation_runner.py
can be used as a starting template for your own runner script.
chtc_runners/simulation_utils.py
contains helper functions for pre- and post-processing iteration selections for retrospective experiments.
Consult the README in that subdirectory for details.
The following are the currently implemented strategies in active_learning_dd/next_batch_selector/
(see thesis document and hyperapameter examples in param_configs/
):
-
ClusterBasedWeightSelector (CBWS): assigns exploitation-exploration weights to every cluster, splits the budget between exploit-explore, then select compounds from most exploitable clusters, followed by selecting most explorable clusters.
-
ClusterBasedRandom: randomly samples a cluster, then randomly samples compounds from within clusters.
-
InstanceBasedRandom: randomly samples compounds from the pool.
-
ClusterBasedDissimilar: samples clusters dissimilarly according to a dissimilarity measure which is by default fingerprint based.
-
InstanceBasedDissimilar: samples compounds dissimilarly from the pool.
-
MABSelector: Upper-Confidence-Bound (UCB) style solution from Multi-Armed Bandits (MAB). Assigns every cluster an upper-bound estimate of the reward that is a combination of a exploitation term and an exploration term. Samples clusters with the highest rewards.