-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial commit of the open-source repo
- Loading branch information
0 parents
commit 5a4dc28
Showing
48 changed files
with
18,437 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
MIT License | ||
|
||
Copyright (c) 2018-present The OpenGNN Authors. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2017-present The OpenNMT Authors. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# OpenGNN | ||
|
||
OpenGNN is a machine learning library for learning over graph-structured data. It was built with generality in mind and supports tasks such as: | ||
|
||
* graph regression | ||
* graph-to-sequence mapping | ||
|
||
It supports various graph encoders including GGNNs, GCNs, SequenceGNNs and other variations of [neural graph message passing](https://arxiv.org/pdf/1704.01212.pdf). | ||
|
||
This library's design and usage patterns are inspired from [OpenNMT](https://github.com/OpenNMT/OpenNMT-tf) and uses the recent [Dataset](https://www.tensorflow.org/programmers_guide/datasets) and [Estimator](https://www.tensorflow.org/programmers_guide/estimators) APIs. | ||
|
||
## Installation | ||
|
||
OpenGNN requires | ||
|
||
* Python (>= 3.5) | ||
* Tensorflow (>= 1.10 < 2.0) | ||
|
||
To install the library aswell as the command-line entry points run | ||
|
||
``` pip install -e .``` | ||
|
||
## Getting Started | ||
|
||
To experiment with the library, you can use one datasets provided in the [data](/data) folder. | ||
For example, to experiment with the chemical dataset, first install the `rdkit` library that | ||
can be obtained by running `conda install -c rdkit rdkit`. | ||
Then, in the [data/chem](/data/chem) folder, run `python get_data.py` to download the dataset. | ||
|
||
After getting the data, generate a node and edge vocabulary for them using | ||
```bash | ||
ognn-build-vocab --field_name node_labels --save_vocab node.vocab \ | ||
molecules_graphs_train.jsonl | ||
ognn-build-vocab --no_pad_token --field_name edges --string_index 0 --save_vocab edge.vocab \ | ||
molecules_graphs_train.jsonl | ||
``` | ||
|
||
### Command Line | ||
|
||
The main entry point to the library is the `ognn-main` command | ||
|
||
```bash | ||
ognn-main <run_type> --model_type <model> --config <config_file.yml> | ||
``` | ||
|
||
Currently there are two run types: `train_and_eval` and `infer` | ||
|
||
For example, to train a model on the previously extracted chemical data | ||
(again inside [data/chem](/data/chem)) using a predefined model in the | ||
catalog | ||
|
||
```bash | ||
ognn-main train_and_eval --model_type chemModel --config config.yml | ||
``` | ||
|
||
You can also define your own model in a custom python script with a `model` function. | ||
For example, we can train using the a custom model in `model.py` using | ||
|
||
```bash | ||
ognn-main train_and_eval --model model.py --config config.yml | ||
``` | ||
|
||
While the training script doesn't log the training to the standard output, | ||
we can monitor training by using tensorboard on the model directory defined in | ||
[data/chem/config.yml](data/chem/config.yml). | ||
|
||
After training, we can perform inference on the valid file running | ||
|
||
``` | ||
ognn-main infer --model_type chemModel --config config.yml \ | ||
--features_file molecules_graphs_valid.jsonl | ||
--prediction_file molecules_predicted_valid.jsonl | ||
``` | ||
|
||
|
||
Examples of other config files can be found in the [data](/data) folder. | ||
|
||
### Library | ||
|
||
The library can also be easily integrated in your own code. | ||
The following example shows how to create a GGNN Encoder to encode a batch of random graphs. | ||
|
||
```python | ||
import tensorflow as tf | ||
import opengnn as ognn | ||
|
||
tf.enable_eager_execution() | ||
|
||
# build a batch of graphs with random initial features | ||
edges = tf.SparseTensor( | ||
indices=[ | ||
[0, 0, 0, 1], [0, 0, 1, 2], | ||
[1, 0, 0, 0], | ||
[2, 0, 1, 0], [2, 0, 2, 1], [2, 0, 3, 2], [2, 0, 4, 3]], | ||
values=[1, 1, 1, 1, 1, 1, 1], | ||
dense_shape=[3, 1, 5, 5]) | ||
node_features = tf.random_uniform((3, 5, 256)) | ||
graph_sizes = [3, 1, 5] | ||
|
||
encoder = ognn.encoders.GGNNEncoder(1, 256) | ||
outputs, state = encoder( | ||
edges, | ||
node_features, | ||
graph_sizes) | ||
|
||
print(outputs) | ||
``` | ||
|
||
Graphs are represented by a sparse adjency matrix with dimensionality | ||
`num_edge_types x num_nodes x num_nodes` and an initial distributed representation for each node. | ||
|
||
Similarly to sequences, when batching we need to pad the graphs to the maximum number of nodes in a graph | ||
|
||
|
||
## Acknowledgments | ||
The design of the library and implementations are based on | ||
* [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf) | ||
* [Gated Graph Neural Networks](https://github.com/Microsoft/gated-graph-neural-network-samples) | ||
|
||
Since most of the code adapted from OpenNMT-tf is spread across multiple files, the license for the | ||
library is located in the [base folder](/OPENNMT.LICENSE) rather than in the headers of the files. | ||
|
||
## Reference | ||
|
||
If you use this library in your own research, please cite | ||
|
||
``` | ||
@inproceedings{ | ||
pfernandes2018structsumm, | ||
title="Structured Neural Summarization", | ||
author={Patrick Fernandes and Miltiadis Allamanis and Marc Brockschmidt }, | ||
booktitle={Proceedings of the 7th International Conference on Learning Representations (ICLR)}, | ||
year={2019}, | ||
url={https://arxiv.org/abs/1811.01824}, | ||
} | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Example parameters (this does not cover every parameter) | ||
|
||
model_dir: model_dir | ||
|
||
data: | ||
train_graphs_file: molecules_graphs_train.jsonl | ||
train_labels_file: molecules_labels_train.jsonl | ||
|
||
eval_graphs_file: molecules_graphs_valid.jsonl | ||
eval_labels_file: molecules_labels_valid.jsonl | ||
|
||
node_vocabulary: node.vocab | ||
edge_vocabulary: edge.vocab | ||
|
||
|
||
params: | ||
learning_rate: 0.001 | ||
param_init: 0.1 | ||
clip_gradients: 1. | ||
maximum_iterations: 250 | ||
|
||
train: | ||
batch_size: 64 | ||
bucket_width: 1 | ||
train_steps: 1000000 | ||
maximum_features_size: 200 | ||
maximum_labels_size: 50 | ||
sample_buffer_size: 10000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
import os | ||
from rdkit import Chem | ||
import glob | ||
import json | ||
import numpy as np | ||
|
||
if not os.path.exists('data'): | ||
os.mkdir('data') | ||
print('made directory ./data/') | ||
|
||
download_path = os.path.join('data', 'dsgdb9nsd.xyz.tar.bz2') | ||
if not os.path.exists(download_path): | ||
print('downloading data to %s ...' % download_path) | ||
source = 'https://ndownloader.figshare.com/files/3195389' | ||
os.system('wget -O %s %s' % (download_path, source)) | ||
print('finished downloading') | ||
|
||
unzip_path = os.path.join('data', 'qm9_raw') | ||
if not os.path.exists(unzip_path): | ||
print('extracting data to %s ...' % unzip_path) | ||
os.mkdir(unzip_path) | ||
os.system('tar xvjf %s -C %s' % (download_path, unzip_path)) | ||
print('finished extracting') | ||
|
||
|
||
def preprocess(): | ||
index_of_mu = 4 | ||
|
||
def read_xyz(file_path): | ||
with open(file_path, 'r') as f: | ||
lines = f.readlines() | ||
smiles = lines[-2].split('\t')[0] | ||
properties = lines[1].split('\t') | ||
mu = float(properties[index_of_mu]) | ||
return {'smiles': smiles, 'mu': mu} | ||
|
||
print('loading train/validation split') | ||
with open('valid_idx.json', 'r') as f: | ||
valid_idx = json.load(f)['valid_idxs'] | ||
valid_files = [os.path.join(unzip_path, 'dsgdb9nsd_%s.xyz' % i) | ||
for i in valid_idx] | ||
|
||
print('reading data...') | ||
raw_data = {'train': [], 'valid': []} | ||
all_files = glob.glob(os.path.join(unzip_path, '*.xyz')) | ||
for file_idx, file_path in enumerate(all_files): | ||
if file_idx % 100 == 0: | ||
print('%.1f %% \r' % | ||
(file_idx / float(len(all_files)) * 100), end=""), | ||
if file_path not in valid_files: | ||
raw_data['train'].append(read_xyz(file_path)) | ||
else: | ||
raw_data['valid'].append(read_xyz(file_path)) | ||
all_mu = [mol['mu'] for mol in raw_data['train']] | ||
mean_mu = np.mean(all_mu) | ||
std_mu = np.std(all_mu) | ||
|
||
def normalize_mu(mu): | ||
return (mu - mean_mu) / std_mu | ||
|
||
def onehot(idx, len): | ||
z = [0 for _ in range(len)] | ||
z[idx] = 1 | ||
return z | ||
|
||
bond_dict = {'SINGLE': 0, 'DOUBLE': 1, 'TRIPLE': 2, "AROMATIC": 3} | ||
|
||
def to_graph(smiles): | ||
mol = Chem.MolFromSmiles(smiles) | ||
mol = Chem.AddHs(mol) | ||
edges = [] | ||
nodes = [] | ||
for bond in mol.GetBonds(): | ||
edges.append((str(bond.GetBondType()), | ||
bond.GetBeginAtomIdx(), bond.GetEndAtomIdx())) | ||
for atom in mol.GetAtoms(): | ||
nodes.append(atom.GetSymbol()) | ||
return nodes, edges | ||
|
||
print('parsing smiles as graphs...') | ||
processed_graphs = {'train': [], 'valid': []} | ||
processed_labels = {'train': [], 'valid': []} | ||
for section in ['train', 'valid']: | ||
for i, (smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section]]): | ||
if i % 100 == 0: | ||
print('%s: %.1f %% \r' % | ||
(section, 100*i/float(len(raw_data[section]))), end="") | ||
nodes, edges = to_graph(smiles) | ||
processed_graphs[section].append({ | ||
'edges': edges, | ||
'node_labels': nodes | ||
}) | ||
processed_labels[section].append([normalize_mu(mu)]) | ||
|
||
print('%s: 100 %% ' % (section)) | ||
with open('molecules_graphs_%s.jsonl' % section, 'w') as f: | ||
for graph in processed_graphs[section]: | ||
f.write(json.dumps(graph) + "\n") | ||
with open('molecules_labels_%s.jsonl' % section, 'w') as f: | ||
for label in processed_labels[section]: | ||
f.write(json.dumps(label) + "\n") | ||
|
||
|
||
preprocess() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import opengnn as ognn | ||
|
||
|
||
def model(): | ||
return ognn.models.GraphRegressor( | ||
source_inputter=ognn.inputters.GraphEmbedder( | ||
edge_vocabulary_file_key="edge_vocabulary", | ||
node_embedder=ognn.inputters.TokenEmbedder( | ||
vocabulary_file_key="node_vocabulary", | ||
embedding_size=64)), | ||
target_inputter=ognn.inputters.FeaturesInputter(), | ||
encoder=ognn.encoders.GGNNEncoder( | ||
num_timesteps=[2, 2], | ||
node_feature_size=64), | ||
name="chemModelCustom") |
Oops, something went wrong.