This repository contains the code neccesary to reproduce the results in the paper:
Document Modeling with External Attention for Sentence Extraction, Shashi Narayan, Ronald Cardenas, Nikos Papasarantopoulos, Shay B. Cohen, Mirella Lapata, Jiangsheng Yu and Yi Chang, ACL 2018, Melbourne, Australia.
To train XNet+ (Title + Caption), run:
python document_summarizer_gpu2.py --max_title_length 1 --max_image_length 10 --train_dir --model_to_load 8 --exp_mode train
from extractive_summ/.
- Datasets and Resources
a) NewsQA
Download the combined dataset from: https://datasets.maluuba.com/NewsQA/dl
Download splitting scripts from NewsQA repo: https://github.com/Maluuba/newsqa
b) SQuAD: https://rajpurkar.github.io/SQuAD-explorer/
c) WikiQA: https://www.microsoft.com/en-us/download/details.aspx?id=52419
d) MarcoMS: http://www.msmarco.org/dataset.aspx
e) 1 billion words benchmark: http://www.statmt.org/lm-benchmark/
- Preprocessing
First, train word embeddings on the 1BW benchmark using word2vec and place the files on answer_selection/datasets/word_emb.
Generate the score files (IDF, ISF, word counts) for each dataset by running
python reformat_corpus.py
from answer_selection/datasets//
The preprocessed files will be placed in the folder: answer_selection/datasets/preprocessed_data/
- Training
Run the scripts run_ in each model folder for training.
- Evaluation
Run the scripts eval_ in each model folder for training.