To train and test our acceptance classification on the ICLR 2017 section, please use the following command, starting with ./code
as your working directory:
cd ./accept_classify/
The script
first calls
to generate features (~35 minutes on ICLR_2017 dataset) then calls
to train and evaluate the classifier (~5 seconds on ICLR_2017 dataset).
Here is brief description of the different scripts involved.
creates (hand-authored and lexical) features for baselines classifiers and save to under dataset folder in each split. This code loads review/paper text and outputs feature vectors in./{train,dev,test}/dataset/
. You should specify type of lexical encoder (e.g., w2v, bow, None) and whether to use hand-authored features or
trains linear classifier using cross-validation and finds the best model on dev set. This code loads the featurized vectors from the previous step and outputs accuracies on train/dev/test
contains different embedding vectorizers and embedding loader.
To train and test our aspect predictor, please use the following command:
cd ./accept_classify/
python "../../data/iclr_2017" {"all","review","paper"} {"dan","rnn","cnn"} {0,1,2,3,4,5,6,7,8}
Here is brief description of each code.
- "" contains three prediction models such as RNN, DAN, CNN
- "" contains some utility functions for loading data
- "" contains configurations for each prediction model
- "" trains a classifier for predicting review score of each aspect (e.g, recommendation, clarity, etc)
- "" aggregates annotated scores (i.e. annotation_full.tsv) into ICLR_2017 reviews.
All of our dataset except NIPS are already preprocessed. For crawling and preprocessing NIPS data, please follow the instruction under ./data/nips_2013-2017/ All other crawlers would be available upon request.
In case you like to crawl the raw dataset and make same data configuration as the paper, please use the following command:
python ../../data/arxiv/{arxiv.cs.cl_2007-2017,acl_2017,...}
Please make sure that pdfs/reviews directories exist and contain raw pdfs/reviews. Note that reviews should be json file of code/model/ class. Then, the script randomly splits them into train/dev/test (0.9/0.05/0.05) into {train/dev/test}/{pdfs/reviews} and science-parse them to {train/dev/test}/{parsed_pdfs}.
Also, download science parser from here and locate the science-parse-cli-assembly-1.2.9-SNAPSHOT.jar file under ./code/lib/