This is the repository for the paper Controlled Language Generation for Language Learning Items at the Industry Track, EMNLP 2022. The code is based heavily on HuggingFace's sequence-to-sequence Trainer examples.
Scripts were tested with python 3.9 and transformers version 4.6.1. Nothing else should be required.
The data is provided as jsonlines objects containing relevant fields for concept-to-sequence generation with control. The files require Git LFS.
To train, call the concept2seq.py script with --mode train, along with the required parameters. The "extras" parameter includes the control: this can be "srl", "wsd", "or "cefr".
# Set a root directory
r=/home/nlp-text/dynamic/kstowe/github/concept-control-gen/
data_json=${r}/data/concept2seq_train.jsonl
# Substitute in your python
/home/conda/kstowe/envs/pretrain/bin/python $r/concept2seq.py \
--mode train \
--data_dir $data_json \
--output_dir $r/models/c2s_test \
--epochs 3 \
--batch_size 32 \
--model_path facebook/bart-base \
# --extras srl \
Prediction works similarly, using the supported parameters.
# Set a nice root
r=/home/nlp-text/dynamic/kstowe/github/concept-control-gen/
/home/conda/kstowe/envs/pretrain/bin/python $r/concept2seq.py \
--mode test \
--output_path $r/outputs/test.txt \
--test_path ${r}/data/concept2seq_test.jsonl \
--model_path kevincstowe/concept2seq