Skip to content

Commit

Permalink
Initial commit of the open-source repo
Browse files Browse the repository at this point in the history
  • Loading branch information
CoderPat committed Jan 27, 2019
0 parents commit 5a4dc28
Show file tree
Hide file tree
Showing 48 changed files with 18,437 additions and 0 deletions.
22 changes: 22 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
MIT License

Copyright (c) 2018-present The OpenGNN Authors.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

21 changes: 21 additions & 0 deletions OPENNMT.LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2017-present The OpenNMT Authors.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
142 changes: 142 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# OpenGNN

OpenGNN is a machine learning library for learning over graph-structured data. It was built with generality in mind and supports tasks such as:

* graph regression
* graph-to-sequence mapping

It supports various graph encoders including GGNNs, GCNs, SequenceGNNs and other variations of [neural graph message passing](https://arxiv.org/pdf/1704.01212.pdf).

This library's design and usage patterns are inspired from [OpenNMT](https://github.com/OpenNMT/OpenNMT-tf) and uses the recent [Dataset](https://www.tensorflow.org/programmers_guide/datasets) and [Estimator](https://www.tensorflow.org/programmers_guide/estimators) APIs.

## Installation

OpenGNN requires

* Python (>= 3.5)
* Tensorflow (>= 1.10 < 2.0)

To install the library aswell as the command-line entry points run

``` pip install -e .```

## Getting Started

To experiment with the library, you can use one datasets provided in the [data](/data) folder.
For example, to experiment with the chemical dataset, first install the `rdkit` library that
can be obtained by running `conda install -c rdkit rdkit`.
Then, in the [data/chem](/data/chem) folder, run `python get_data.py` to download the dataset.

After getting the data, generate a node and edge vocabulary for them using
```bash
ognn-build-vocab --field_name node_labels --save_vocab node.vocab \
molecules_graphs_train.jsonl
ognn-build-vocab --no_pad_token --field_name edges --string_index 0 --save_vocab edge.vocab \
molecules_graphs_train.jsonl
```

### Command Line

The main entry point to the library is the `ognn-main` command

```bash
ognn-main <run_type> --model_type <model> --config <config_file.yml>
```

Currently there are two run types: `train_and_eval` and `infer`

For example, to train a model on the previously extracted chemical data
(again inside [data/chem](/data/chem)) using a predefined model in the
catalog

```bash
ognn-main train_and_eval --model_type chemModel --config config.yml
```

You can also define your own model in a custom python script with a `model` function.
For example, we can train using the a custom model in `model.py` using

```bash
ognn-main train_and_eval --model model.py --config config.yml
```

While the training script doesn't log the training to the standard output,
we can monitor training by using tensorboard on the model directory defined in
[data/chem/config.yml](data/chem/config.yml).

After training, we can perform inference on the valid file running

```
ognn-main infer --model_type chemModel --config config.yml \
--features_file molecules_graphs_valid.jsonl
--prediction_file molecules_predicted_valid.jsonl
```


Examples of other config files can be found in the [data](/data) folder.

### Library

The library can also be easily integrated in your own code.
The following example shows how to create a GGNN Encoder to encode a batch of random graphs.

```python
import tensorflow as tf
import opengnn as ognn

tf.enable_eager_execution()

# build a batch of graphs with random initial features
edges = tf.SparseTensor(
indices=[
[0, 0, 0, 1], [0, 0, 1, 2],
[1, 0, 0, 0],
[2, 0, 1, 0], [2, 0, 2, 1], [2, 0, 3, 2], [2, 0, 4, 3]],
values=[1, 1, 1, 1, 1, 1, 1],
dense_shape=[3, 1, 5, 5])
node_features = tf.random_uniform((3, 5, 256))
graph_sizes = [3, 1, 5]

encoder = ognn.encoders.GGNNEncoder(1, 256)
outputs, state = encoder(
edges,
node_features,
graph_sizes)

print(outputs)
```

Graphs are represented by a sparse adjency matrix with dimensionality
`num_edge_types x num_nodes x num_nodes` and an initial distributed representation for each node.

Similarly to sequences, when batching we need to pad the graphs to the maximum number of nodes in a graph


## Acknowledgments
The design of the library and implementations are based on
* [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf)
* [Gated Graph Neural Networks](https://github.com/Microsoft/gated-graph-neural-network-samples)

Since most of the code adapted from OpenNMT-tf is spread across multiple files, the license for the
library is located in the [base folder](/OPENNMT.LICENSE) rather than in the headers of the files.

## Reference

If you use this library in your own research, please cite

```
@inproceedings{
pfernandes2018structsumm,
title="Structured Neural Summarization",
author={Patrick Fernandes and Miltiadis Allamanis and Marc Brockschmidt },
booktitle={Proceedings of the 7th International Conference on Learning Representations (ICLR)},
year={2019},
url={https://arxiv.org/abs/1811.01824},
}
```






28 changes: 28 additions & 0 deletions data/chem/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Example parameters (this does not cover every parameter)

model_dir: model_dir

data:
train_graphs_file: molecules_graphs_train.jsonl
train_labels_file: molecules_labels_train.jsonl

eval_graphs_file: molecules_graphs_valid.jsonl
eval_labels_file: molecules_labels_valid.jsonl

node_vocabulary: node.vocab
edge_vocabulary: edge.vocab


params:
learning_rate: 0.001
param_init: 0.1
clip_gradients: 1.
maximum_iterations: 250

train:
batch_size: 64
bucket_width: 1
train_steps: 1000000
maximum_features_size: 200
maximum_labels_size: 50
sample_buffer_size: 10000
104 changes: 104 additions & 0 deletions data/chem/get_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import os
from rdkit import Chem
import glob
import json
import numpy as np

if not os.path.exists('data'):
os.mkdir('data')
print('made directory ./data/')

download_path = os.path.join('data', 'dsgdb9nsd.xyz.tar.bz2')
if not os.path.exists(download_path):
print('downloading data to %s ...' % download_path)
source = 'https://ndownloader.figshare.com/files/3195389'
os.system('wget -O %s %s' % (download_path, source))
print('finished downloading')

unzip_path = os.path.join('data', 'qm9_raw')
if not os.path.exists(unzip_path):
print('extracting data to %s ...' % unzip_path)
os.mkdir(unzip_path)
os.system('tar xvjf %s -C %s' % (download_path, unzip_path))
print('finished extracting')


def preprocess():
index_of_mu = 4

def read_xyz(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()
smiles = lines[-2].split('\t')[0]
properties = lines[1].split('\t')
mu = float(properties[index_of_mu])
return {'smiles': smiles, 'mu': mu}

print('loading train/validation split')
with open('valid_idx.json', 'r') as f:
valid_idx = json.load(f)['valid_idxs']
valid_files = [os.path.join(unzip_path, 'dsgdb9nsd_%s.xyz' % i)
for i in valid_idx]

print('reading data...')
raw_data = {'train': [], 'valid': []}
all_files = glob.glob(os.path.join(unzip_path, '*.xyz'))
for file_idx, file_path in enumerate(all_files):
if file_idx % 100 == 0:
print('%.1f %% \r' %
(file_idx / float(len(all_files)) * 100), end=""),
if file_path not in valid_files:
raw_data['train'].append(read_xyz(file_path))
else:
raw_data['valid'].append(read_xyz(file_path))
all_mu = [mol['mu'] for mol in raw_data['train']]
mean_mu = np.mean(all_mu)
std_mu = np.std(all_mu)

def normalize_mu(mu):
return (mu - mean_mu) / std_mu

def onehot(idx, len):
z = [0 for _ in range(len)]
z[idx] = 1
return z

bond_dict = {'SINGLE': 0, 'DOUBLE': 1, 'TRIPLE': 2, "AROMATIC": 3}

def to_graph(smiles):
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
edges = []
nodes = []
for bond in mol.GetBonds():
edges.append((str(bond.GetBondType()),
bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()))
for atom in mol.GetAtoms():
nodes.append(atom.GetSymbol())
return nodes, edges

print('parsing smiles as graphs...')
processed_graphs = {'train': [], 'valid': []}
processed_labels = {'train': [], 'valid': []}
for section in ['train', 'valid']:
for i, (smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section]]):
if i % 100 == 0:
print('%s: %.1f %% \r' %
(section, 100*i/float(len(raw_data[section]))), end="")
nodes, edges = to_graph(smiles)
processed_graphs[section].append({
'edges': edges,
'node_labels': nodes
})
processed_labels[section].append([normalize_mu(mu)])

print('%s: 100 %% ' % (section))
with open('molecules_graphs_%s.jsonl' % section, 'w') as f:
for graph in processed_graphs[section]:
f.write(json.dumps(graph) + "\n")
with open('molecules_labels_%s.jsonl' % section, 'w') as f:
for label in processed_labels[section]:
f.write(json.dumps(label) + "\n")


preprocess()
15 changes: 15 additions & 0 deletions data/chem/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import opengnn as ognn


def model():
return ognn.models.GraphRegressor(
source_inputter=ognn.inputters.GraphEmbedder(
edge_vocabulary_file_key="edge_vocabulary",
node_embedder=ognn.inputters.TokenEmbedder(
vocabulary_file_key="node_vocabulary",
embedding_size=64)),
target_inputter=ognn.inputters.FeaturesInputter(),
encoder=ognn.encoders.GGNNEncoder(
num_timesteps=[2, 2],
node_feature_size=64),
name="chemModelCustom")
Loading

0 comments on commit 5a4dc28

Please sign in to comment.