Source code for our ACL 2024 paper 'Decoder-only Streaming Transformer for Simultaneous Translation'
Highlights
- Our method utilizes the Decoder-only architecture to determine the policy and translation concurrently.
- Our method alleviates the training and inference costs associated with using a Decode-only architecture.
- Our method attains the state-of-the-art performance on evaluation datasets.
-
Python version = 3.8
-
PyTorch version = 2.1
-
Install our library:
git clone https://github.com/ictnlp/DST
cd DST
pip install --editable ./
We use the data of IWSLT15 English-Vietnamese (download here), WMT16 English-Romanian (download here) and WMT15 German-English (download here).
For WMT15 German-English, we tokenize the corpus via mosesdecoder/scripts/tokenizer/normalize-punctuation.perl and apply BPE with 32K merge operations via subword_nmt/apply_bpe.py.
Then, we process the data into the fairseq format, adding --joined-dictionary
for all tasks:
SRC=source_language
TGT=target_language
TRAIN_DATA=path_to_training_data
VALID_DATA=path_to_valid_data
TEST_DATA=path_to_test_data
DATA=path_to_processed_data
# add --joined-dictionary for WMT15 German-English
fairseq-preprocess --source-lang ${SRC} --target-lang ${TGT} \
--trainpref ${TRAIN_DATA} --validpref ${VALID_DATA} \
--testpref ${TEST_DATA}\
--destdir ${DATA}
Our approach involves full-sentence pre-training and simultaneous translation fine-tuning. The pre-training phase will utilize a full-sentence cross-entropy loss, while the simultaneous translation phase will employ the method described in our paper. The scripts of the training stage can be found in train_scripts.
Considering the amount of training data for the WMT15 German-English task, we provide the corresponding models [baidu].
In this phase, we train the Decoder-only model with the following full-sentence training loss to obtain good initial parameters .
export CUDA_VISIBLE_DEVICES=0,1,2,3
TGT_FILE=model_path
MODELFILE=checkpoints/${TGT_FILE}
DATAFILE=data_file
python train.py --ddp-backend=no_c10d ${DATAFILE} --arch transformer_lm \
--task translation \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--decoder-attention-heads 8 \
--decoder-layers 16 \
--criterion decoder_only \
--report-accuracy \
--label-smoothing 0.1 \
--save-dir ${MODELFILE} \
--max-tokens 8192 --update-freq 1 \
--skip-invalid-size-inputs-valid-test \
--keep-best-checkpoints 10 \
--best-checkpoint-metric loss \
--fp16 \
--max-target-positions 1024 \
--tokens-per-sample 1024 \
--is-pretrain \
--log-interval 100
In this phase, we train the Decoder-only model to learn to proficiently translate within streaming scenarios.
export CUDA_VISIBLE_DEVICES=0,1,2,3
TRAIN_THRESHOLD=0.45
CURR_STEPS=50000
TGT_FILE=ckpt_${TRAIN_THRESHOLD}_${CURR_STEPS}
MODELFILE=checkpoints/${TGT_FILE}
DATAFILE=data_file
python train.py --ddp-backend=no_c10d ${DATAFILE} --arch transformer_lm \
--task translation \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--decoder-attention-heads 8 \
--decoder-layers 16 \
--criterion dst_loss \
--report-accuracy \
--label-smoothing 0.1 \
--save-dir ${MODELFILE} \
--max-tokens 2048 --update-freq 4 \
--skip-invalid-size-inputs-valid-test \
--keep-best-checkpoints 10 \
--best-checkpoint-metric loss \
--max-target-positions 1024 \
--tokens-per-sample 1024 \
--reset-optimizer \
--reset-lr-scheduler \
--reset-meters \
--reset-dataloader \
--train-threshold ${TRAIN_THRESHOLD} \
--curr-steps ${CURR_STEPS} \
--log-interval 100
We evaluate the model with the following command. It can be found in inference_scripts.
export CUDA_VISIBLE_DEVICES=0
TGT_FILE=model_path
MODELFILE=checkpoints/${TGT_FILE}
DATAFILE=data_file
THRESHOLD=infer_threshold
REFERENCE=ref.txt
python fairseq_cli/generate.py ${DATAFILE} --path ${MODELFILE} --batch-size 100 --beam 1 --remove-bpe --threshold ${THRESHOLD} --preRead 2 --soft-attention > decoding.txt
grep ^H decoding.txt | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > translation.txt
./multi-bleu.perl -lc ${ref.txt} < translation.txt
If you have any questions, please feel free to submit an issue or contact [email protected].
If our work is useful for you, please cite as:
@inproceedings{dst,
title = "Decoder-only Streaming Transformer for Simultaneous Translation",
author = "Guo, Shoutao and
Zhang, Shaolei and
Feng, Yang",
booktitle = "Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)",
month = Aug,
year = "2024",
publisher = "Association for Computational Linguistics",
address = "Bangkok, Thailand",
}