Initial commit of the open-source repo

CoderPat · Jan 27, 2019 · 5a4dc28 · 5a4dc28
commit 5a4dc28
Show file tree

Hide file tree

Showing 48 changed files with 18,437 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,22 @@
+MIT License
+
+Copyright (c) 2018-present The OpenGNN Authors.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
diff --git a/OPENNMT.LICENSE b/OPENNMT.LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2017-present The OpenNMT Authors.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,142 @@
+# OpenGNN
+
+OpenGNN is a machine learning library for learning over graph-structured data. It was built with generality in mind and supports tasks such as:
+
+* graph regression
+* graph-to-sequence mapping
+
+It supports various graph encoders including GGNNs, GCNs, SequenceGNNs and other variations of [neural graph message passing](https://arxiv.org/pdf/1704.01212.pdf).
+
+This library's design and usage patterns are inspired from [OpenNMT](https://github.com/OpenNMT/OpenNMT-tf) and uses the recent [Dataset](https://www.tensorflow.org/programmers_guide/datasets) and [Estimator](https://www.tensorflow.org/programmers_guide/estimators) APIs.
+
+## Installation
+
+OpenGNN requires 
+
+* Python (>= 3.5)
+* Tensorflow (>= 1.10 < 2.0)
+
+To install the library aswell as the command-line entry points run
+
+``` pip install -e .```
+
+## Getting Started
+
+To experiment with the library, you can use one datasets provided in the [data](/data) folder.
+For example, to experiment with the chemical dataset, first install the `rdkit` library that 
+can be obtained by running `conda install -c rdkit rdkit`.
+Then, in the [data/chem](/data/chem) folder, run `python get_data.py` to download the dataset.
+
+After getting the data, generate a node and edge vocabulary for them using 
+```bash
+ognn-build-vocab --field_name node_labels --save_vocab node.vocab \
+                 molecules_graphs_train.jsonl
+ognn-build-vocab --no_pad_token --field_name edges --string_index 0 --save_vocab edge.vocab \
+                 molecules_graphs_train.jsonl
+```
+
+### Command Line
+
+The main entry point to the library is the `ognn-main` command
+
+```bash
+ognn-main <run_type> --model_type <model> --config <config_file.yml>
+```
+
+Currently there are two run types: `train_and_eval` and `infer`
+
+For example, to train a model on the previously extracted chemical data
+(again inside [data/chem](/data/chem)) using a predefined model in the 
+catalog
+
+```bash
+ognn-main train_and_eval --model_type chemModel --config config.yml
+```
+
+You can also define your own model in a custom python script with a `model` function.
+For example, we can train using the a custom model in `model.py` using
+
+```bash
+ognn-main train_and_eval --model model.py --config config.yml
+```
+
+While the training script doesn't log the training to the standard output, 
+we can monitor training by using tensorboard on the model directory defined in
+[data/chem/config.yml](data/chem/config.yml).
+
+After training, we can perform inference on the valid file running
+
+```
+ognn-main infer --model_type chemModel --config config.yml \
+                --features_file molecules_graphs_valid.jsonl
+                --prediction_file molecules_predicted_valid.jsonl
+```
+
+
+Examples of other config files can be found in the [data](/data) folder.
+
+### Library
+
+The library can also be easily integrated in your own code.
+The following example shows how to create a GGNN Encoder to encode a batch of random graphs.
+
+```python
+import tensorflow as tf
+import opengnn as ognn
+
+tf.enable_eager_execution()
+
+# build a batch of graphs with random initial features
+edges = tf.SparseTensor(
+    indices=[
+        [0, 0, 0, 1], [0, 0, 1, 2],
+        [1, 0, 0, 0],
+        [2, 0, 1, 0], [2, 0, 2, 1], [2, 0, 3, 2], [2, 0, 4, 3]],
+    values=[1, 1, 1, 1, 1, 1, 1],
+    dense_shape=[3, 1, 5, 5])
+node_features = tf.random_uniform((3, 5, 256))
+graph_sizes = [3, 1, 5]
+
+encoder = ognn.encoders.GGNNEncoder(1, 256)
+outputs, state = encoder(
+    edges,
+    node_features,
+    graph_sizes)
+
+print(outputs)
+```
+
+Graphs are represented by a sparse adjency matrix with dimensionality 
+`num_edge_types x num_nodes x num_nodes` and an initial distributed representation for each node.
+
+Similarly to sequences, when batching we need to pad the graphs to the maximum number of nodes in a graph
+
+
+## Acknowledgments
+The design of the library and implementations are based on 
+* [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf)
+* [Gated Graph Neural Networks](https://github.com/Microsoft/gated-graph-neural-network-samples)
+
+Since most of the code adapted from OpenNMT-tf is spread across multiple files, the license for the
+library is located in the [base folder](/OPENNMT.LICENSE) rather than in the headers of the files.
+
+## Reference
+
+If you use this library in your own research, please cite
+
+```
+@inproceedings{
+    pfernandes2018structsumm,
+    title="Structured Neural Summarization",
+    author={Patrick Fernandes and Miltiadis Allamanis and Marc Brockschmidt },
+    booktitle={Proceedings of the 7th International Conference on Learning Representations (ICLR)},
+    year={2019},
+    url={https://arxiv.org/abs/1811.01824},
+}
+```
+
+
+
+
+
+
diff --git a/data/chem/config.yml b/data/chem/config.yml
@@ -0,0 +1,28 @@
+# Example parameters (this does not cover every parameter)
+
+model_dir: model_dir
+
+data:
+  train_graphs_file: molecules_graphs_train.jsonl
+  train_labels_file: molecules_labels_train.jsonl
+
+  eval_graphs_file: molecules_graphs_valid.jsonl
+  eval_labels_file: molecules_labels_valid.jsonl
+
+  node_vocabulary: node.vocab
+  edge_vocabulary: edge.vocab
+
+
+params:
+  learning_rate: 0.001
+  param_init: 0.1
+  clip_gradients: 1.
+  maximum_iterations: 250
+
+train:
+  batch_size: 64
+  bucket_width: 1
+  train_steps: 1000000
+  maximum_features_size: 200
+  maximum_labels_size: 50
+  sample_buffer_size: 10000
diff --git a/data/chem/get_data.py b/data/chem/get_data.py
@@ -0,0 +1,104 @@
+import os
+from rdkit import Chem
+import glob
+import json
+import numpy as np
+
+if not os.path.exists('data'):
+    os.mkdir('data')
+    print('made directory ./data/')
+
+download_path = os.path.join('data', 'dsgdb9nsd.xyz.tar.bz2')
+if not os.path.exists(download_path):
+    print('downloading data to %s ...' % download_path)
+    source = 'https://ndownloader.figshare.com/files/3195389'
+    os.system('wget -O %s %s' % (download_path, source))
+    print('finished downloading')
+
+unzip_path = os.path.join('data', 'qm9_raw')
+if not os.path.exists(unzip_path):
+    print('extracting data to %s ...' % unzip_path)
+    os.mkdir(unzip_path)
+    os.system('tar xvjf %s -C %s' % (download_path, unzip_path))
+    print('finished extracting')
+
+
+def preprocess():
+    index_of_mu = 4
+
+    def read_xyz(file_path):
+        with open(file_path, 'r') as f:
+            lines = f.readlines()
+            smiles = lines[-2].split('\t')[0]
+            properties = lines[1].split('\t')
+            mu = float(properties[index_of_mu])
+        return {'smiles': smiles, 'mu': mu}
+
+    print('loading train/validation split')
+    with open('valid_idx.json', 'r') as f:
+        valid_idx = json.load(f)['valid_idxs']
+    valid_files = [os.path.join(unzip_path, 'dsgdb9nsd_%s.xyz' % i)
+                   for i in valid_idx]
+
+    print('reading data...')
+    raw_data = {'train': [], 'valid': []}
+    all_files = glob.glob(os.path.join(unzip_path, '*.xyz'))
+    for file_idx, file_path in enumerate(all_files):
+        if file_idx % 100 == 0:
+            print('%.1f %%    \r' %
+                  (file_idx / float(len(all_files)) * 100), end=""),
+        if file_path not in valid_files:
+            raw_data['train'].append(read_xyz(file_path))
+        else:
+            raw_data['valid'].append(read_xyz(file_path))
+    all_mu = [mol['mu'] for mol in raw_data['train']]
+    mean_mu = np.mean(all_mu)
+    std_mu = np.std(all_mu)
+
+    def normalize_mu(mu):
+        return (mu - mean_mu) / std_mu
+
+    def onehot(idx, len):
+        z = [0 for _ in range(len)]
+        z[idx] = 1
+        return z
+
+    bond_dict = {'SINGLE': 0, 'DOUBLE': 1, 'TRIPLE': 2, "AROMATIC": 3}
+
+    def to_graph(smiles):
+        mol = Chem.MolFromSmiles(smiles)
+        mol = Chem.AddHs(mol)
+        edges = []
+        nodes = []
+        for bond in mol.GetBonds():
+            edges.append((str(bond.GetBondType()),
+                          bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()))
+        for atom in mol.GetAtoms():
+            nodes.append(atom.GetSymbol())
+        return nodes, edges
+
+    print('parsing smiles as graphs...')
+    processed_graphs = {'train': [], 'valid': []}
+    processed_labels = {'train': [], 'valid': []}
+    for section in ['train', 'valid']:
+        for i, (smiles, mu) in enumerate([(mol['smiles'], mol['mu']) for mol in raw_data[section]]):
+            if i % 100 == 0:
+                print('%s: %.1f %%      \r' %
+                      (section, 100*i/float(len(raw_data[section]))), end="")
+            nodes, edges = to_graph(smiles)
+            processed_graphs[section].append({
+                'edges': edges,
+                'node_labels': nodes
+            })
+            processed_labels[section].append([normalize_mu(mu)])
+
+        print('%s: 100 %%      ' % (section))
+        with open('molecules_graphs_%s.jsonl' % section, 'w') as f:
+            for graph in processed_graphs[section]:
+                f.write(json.dumps(graph) + "\n")
+        with open('molecules_labels_%s.jsonl' % section, 'w') as f:
+            for label in processed_labels[section]:
+                f.write(json.dumps(label) + "\n")
+
+
+preprocess()
diff --git a/data/chem/model.py b/data/chem/model.py
@@ -0,0 +1,15 @@
+import opengnn as ognn
+
+
+def model():
+    return ognn.models.GraphRegressor(
+        source_inputter=ognn.inputters.GraphEmbedder(
+            edge_vocabulary_file_key="edge_vocabulary",
+            node_embedder=ognn.inputters.TokenEmbedder(
+                vocabulary_file_key="node_vocabulary",
+                embedding_size=64)),
+        target_inputter=ognn.inputters.FeaturesInputter(),
+        encoder=ognn.encoders.GGNNEncoder(
+            num_timesteps=[2, 2],
+            node_feature_size=64),
+        name="chemModelCustom")