We use the pipenv dependency/virtualenv framework:
$ pipenv install
$ pipenv shell
(mac-graph-sjOzWQ6Y) $
You can watch the model predict values from the hold-back data:
$ python -m macgraph.predict --name my_dataset --model-version 0ds9f0s
predicted_label: shabby
actual_label: derilict
src: How <space> clean <space> is <space> 3 ? <unk> <eos> <eos>
-------
predicted_label: small
actual_label: medium-sized
src: How <space> big <space> is <space> 4 ? <unk> <eos> <eos>
-------
predicted_label: medium-sized
actual_label: tiny
src: How <space> big <space> is <space> 7 ? <unk> <eos> <eos>
-------
predicted_label: True
actual_label: True
src: Does <space> 1 <space> have <space> rail <space> connections ? <unk>
-------
predicted_label: True
actual_label: False
src: Does <space> 0 <space> have <space> rail <space> connections ? <unk>
-------
predicted_label: victorian
actual_label: victorian
src: What <space> architectural <space> style <space> is <space> 1 ? <unk>
TODO: Get it predicting from your typed input
To train the model, you need training data.
If you want to skip this step, you can download the pre-built data from our public dataset. This repo is a work in progress so the format is still in flux.
The underlying data (a Graph-Question-Answer YAML from CLEVR-graph) must be pre-processed for training and evaluation. The YAML is transformed into TensorFlow records, and split into train-evaluate-predict tranches.
First generate a gqa.yaml
with the command:
clevr-graph$ python -m gqa.generate --count 50000 --int-names
cp data/gqa-some-id.yaml ../mac-graph/input_data/raw/my_dataset.yaml
Then build (that is, pre-process into a vocab table and tfrecords) the data:
mac-graph$ python -m macgraph.input.build --name my_dataset
--limit N
will only read N records from the YAML and only output a total of N tf-records (split across three tranches)--type-string-prefix StationProperty
will filter just questions with type string prefix "StationProperty"
Let's build a model. (Note, this requires training data from the previous section).
General advice is to have at least 40,000 training records (e.g. build from 50,000 GQA triples)
python -m macgraph.train --name my_dataset