Code for the paper:
Structured Attention Networks
Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush
ICLR 2017
- Python:
h5py
,numpy
- Lua:
nn
,nngraph
,cutorch
,cunn
,nngraph
We additionally require a custom cuda-mod
package which implements some custom
CUDA functions for the linear-chain CRF. This can be installed via
git clone https://github.com/harvardnlp/cuda-mod
cd cuda-mod && luarocks install rocks/cuda-mod-1.0-0.rockspec
The structured attention layers described in the paper
can be found under the folder models/
. Specifically:
CRF.lua
: Segmentation attention layer (i.e. linear-chain CRF)EisnerCRF.lua
: Syntactic attention layer (i.e. first-order graph-based dependency parser)
These layers are modular and can be plugged into other deep models. We use them in place of standard simple (softmax) attention layers for neural machine translation, natural langage inference, and question answering (see below).
The Japanese-English data used for the paper can be downloaded by following the instructions at http://lotus.kuee.kyoto-u.ac.jp/ASPEC
To preprocess the data, run
python preprocess-nmt.py --srcfile path-to-source-train --targetfile path-to-target-train
--srcvalfile path-to-source-val --targetvalfile path-to-target-val --outputfile data/nmt
See the preprocess-nmt.py
file for other arguments like maximum sequence length, vocabulary size,
batch size, etc.
Baseline simple (i.e. softmax) attention model
th train-nmt.lua -data_file path-to-train -val_data_file path-to-val -attn softmax -savefile nmt-simple
Sigmoid attention
th train-nmt.lua -data_file path-to-train -val_data_file path-to-val -attn sigmoid -savefile nmt-sigmoid
Structured attention (i.e. segmentation attention)
th train-nmt.lua -data_file path-to-train -val_data_file path-to-val -attn crf -savefile nmt-struct
Here path-to-train
and path-to-val
are the *.hdf5
files from running preprocess-nmt.py
.
You can add -gpuid 1
to use the (first) GPU, and change the argument to -savefile
if you
wish to save to a different path.
Note: structured attention only works with the GPU.
th predict-nmt.lua -src_file path-to-source-test -targ_file path-to-target-test
-src_dict path-to-source-dict -targ_dict -path-to-target-dict -output_file pred.txt
-src_dict
and -targ_dict
are the *.dict
files created from running preprocess-nmt.py
.
Argument to -targ_file
is optional. The code will output predictions to pred.txt
, and
you can again add -gpuid 1
to use the GPU.
Evaluation is done with the multi-bleu.perl
script from Moses.
Stanford Natural Language Inference (SNLI) dataset can be downloaded from http://nlp.stanford.edu/projects/snli/
Pre-trained GloVe embeddings can be downloaded from http://nlp.stanford.edu/projects/glove/
First we need to process the SNLI data:
python process-snli.py --data_filder path-to-snli-folder --out_folder path-to-output-folder
Then run:
python preprocess-entail.py --srcfile path-to-sent1-train --targetfile path-to-sent2-train
--labelfile path-to-label-train --srcvalfile path-to-sent1-val --targetvalfile path-to-sent2-val
--labelvalfile path-to-label-val --srctestfile path-to-sent1-test --targettestfile path-to-sent2-test
--labeltestfile path-to-label-test --outputfile data/entail --glove path-to-glove
Here path-to-sent1-train
is the path to the src-train.txt
file created from running process-snli.py
(and path-to-sent2-train
= targ-train.txt
, path-to-label-train
= label-train.txt
, etc.)
preprocess-entail.py
will create the data hdf5 files. Vocabulary is based on the pretrained Glove embeddings,
with path-to-glove
being the path to the pretrained Glove word vecs (i.e. the glove.840B.300d.txt
file). sent1
is the premise and sent2
is the hypothesis.
Now run:
python get_pretrain_vecs.py --glove path-to-glove --outputfile data/glove.hdf5
--dictionary path-to-dict
path-to-dict
is the *.word.dict
file created from running preprocess-entail.py
Baseline model (i.e. no intra-sentence attention)
th train-entail.lua -attn none -data_file path-to-train -val_data_file path-to-val
-test_data_file path-to-test -pre_word_vecs path-to-word-vecs -savefile entail-baseline
Simple attention (i.e. softmax attention)
th train-entail.lua -attn simple -data_file path-to-train -val_data_file path-to-val
-test_data_file path-to-test -pre_word_vecs path-to-word-vecs -savefile entail-simple
Structured attention (i.e. syntactic attention)
th train-entail.lua -attn struct -data_file path-to-train -val_data_file path-to-val
-test_data_file path-to-test -pre_word_vecs path-to-word-vecs -savefile entail-struct
Here path-to-word-vecs
is the hdf5 file created from running get_pretrain_vecs.py
and
the path-to-train
are the *.hdf5
files created from running preprocess-entail.py
.
You can add -gpuid 1
to use the (first) GPU, and change the argument to -savefile
if you
wish to save to a different path.
The baseline model essentially replicates A Decomposable Attention Model for Natural Language Inference. Parikh et al. EMNLP 2016. The differences are that we use a hidden layer size of 300 (they use 200), batch size of 32 (they use 4), and train for 100 epochs (they train for 400 epochs with asynchronous SGD).
See train-entail.lua
(or the paper) for hyperparameters and more training options.
The bAbI project (bAbI) dataset can be downloaded in all versions from https://research.fb.com/projects/babi/, or a copy of v1.0 from https://github.com/harvardnlp/MemN2N/tree/master/babi_data/en which this code was tested on. The latter is the 1k set where each task includes 1,000 questions.
First run:
python preprocess-qa.py -dir input-data-path -vocabsize max-vocabulary-size
This will create the data hdf5 files. Vocabulary is based on the input data, and will be written to word_to_idx.csv
.
For the baseline model, see our MemN2N implementation.
To train structured attention with binary-potential CRF, run:
th train-qa.lua -datafile data-file.hdf5 -classifier classifier-type
Here data-file.hdf5
is the hdf5 file created from running preprocess-qa.py
and
the classifier
is either binarycrf
or unarycrf
. You can add -cuda
to use the (first) GPU, and add -save -saveminacc number
if you wish to save model (only if the accuracy on test set is at least that specified). To train with Position Encoding or Temporal Encoding (as described in End-End Memory Networks Sukhbaatar et al. NIPS 2015), use -pe
and -te
respectively. Note that some default parameters (such as embedding size, max history etc...) are different from those used in the MemN2N paper. In addition, this code implements a 2-step CRF which is tested only on bAbI tasks with 2 supporting facts (however should in theory work for all tasks).
See train-qa.lua
(or the paper) for hyperparameters and more training options.
MIT