This repository contains training and sampling code for the paper: Grammar Variational Autoencoder.
Install (CPU version) using pip install -r requirements.txt
For GPU compatibility, replace the fourth line in requirements.txt with: https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-0.12.1-cp27-none-linux_x86_64.whl
To create the molecule datasets, call:
python make_zinc_dataset_grammar.py
python make_zinc_dataset_str.py
The equation dataset can be downloaded here: grammar, string
To train the molecule models, call:
python train_zinc.py
% the grammar modelpython train_zinc.py --latent_dim=2 --epochs=50
% train a model with a 2D latent space and 50 epochspython train_zinc_str.py
python train_eq.py
% the grammar modelpython train_eq.py --latent_dim=2 --epochs=50
% train a model with a 2D latent space and 50 epochspython train_eq_str.py
The file molecule_vae.py can be used to encode and decode SMILES strings. For a demo run:
python encode_decode_zinc.py
The analogous file equation_vae.py can encode and decode equation strings. Run:
python encode_decode_eq.py
The Bayesian optimization experiments use sparse Gaussian processes coded in theano.
We use a modified version of theano with a few add ons, e.g. to compute the log determinant of a positive definite matrix in a numerically stable manner. The modified version of theano can be insalled by going to the folder Theano-master and typing
python setup.py install
The experiments with molecules require the rdkit library, which can be installed as described in http://www.rdkit.org/docs/Install.html.
The Bayesian optimization experiments can be replicated as follows:
1 - Generate the latent representations of molecules and equations. For this, go to the folders
molecule_optimization/latent_features_and_targets_grammar/
molecule_optimization/latent_features_and_targets_character/
equation_optimization/latent_features_and_targets_grammar/
equation_optimization/latent_features_and_targets_character/
and type
python generate_latent_features_and_targets.py
2 - Go to the folders
molecule_optimization/simulation1/grammar/
molecule_optimization/simulation1/character/
equation_optimization/simulation1/grammar/
equation_optimization/simulation1/character/
and type
nohup python run_bo.py &
Repeat this step for all the simulation folders (simulation2,...,simulation10). For speed, it is recommended to do this in a computer cluster in parallel.
2 - Extract the results by going to the folders
molecule_optimization/
equation_optimization/
and typing
python get_final_results.py
./get_average_test_RMSE_LL.sh