Multitarget TED Talks

This recipe builds systems from the TED Talks corpus at: http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/ . There are many languages available (~20), so we can try many different language-pairs.

1. Setup

First, download the data. It's about 575MB compressed and 1.8GB uncompressed.

sh ./0_download_data.sh

Then, setup the task for the language you are interested. Let's do Chinese (zh) for now.
The following command creates a new working directory (zh-en) and populates it with several hyperparameter files

sh ./1_setup_task.sh zh
cd zh-en
ls

You should see files like rs1.hpm which is one of the hyperparameter files we will run with. This file specifies a BPE symbol size of 30k for source and target, 512-dim word embeddings, 512-dim LSTM hiddent units in a 1-layer seq2seq network. Further, the checkpoint frequency is 4000 updates and all model information will be saved in ./rs1.

2. Preprocessing and Training

First, make sure we are in the correct working directory.

pwd

All hyperparameter files and instructions below assume we are in $rootdir/egs/ted/zh-en, where $rootdir is the location of the sockeye-recipes installation.

Now, we can preprocess the tokenized training and dev data using BPE.

../../../scripts/preprocess-bpe.sh rs1.hpm

The resulting BPE vocabulary file (for English) is: data-bpe/train.bpe-30000.en.bpe_vocab and the segmented training file is: data-bpe/train.bpe-30000.en. For Chinese, replace en by zh. These are the files we train on.

To train, we will use qsub and gpu (On a GeForce GTX 1080 Ti, this should take about 4 hours):

qsub -S /bin/bash -V -cwd -q gpu.q -l gpu=1,h_rt=12:00:00,num_proc=2 -j y ../../../scripts/train.sh -p rs1.hpm -e sockeye_gpu

Alternatively, if using local cpu:

../../../scripts/train.sh -p rs1.hpm -e sockeye_cpu

3. Evaluation

Again, make sure we are in the correct working directory ($rootdir/egs/ted/zh-en).

pwd

The test set we want to translate is ../multitarget-ted/en-zh/tok/ted_test1_en-zh.tok.zh. We translate it using rs1 via qsub on gpu (this should take 10 minutes or less):

qsub -S /bin/bash -V -cwd -q gpu.q -l gpu=1,h_rt=00:30:00 -j y ../../../scripts/translate.sh -p rs1.hpm -i ../multitarget-ted/en-zh/tok/ted_test1_en-zh.tok.zh -o rs1/ted_test1_en-zh.tok.en.1best -e sockeye_gpu

Alternatively, to translate using local cpu:

../../../scripts/translate.sh -p rs1.hpm -i ../multitarget-ted/en-zh/tok/ted_test1_en-zh.tok.zh -o rs1/ted_test1_en-zh.tok.en.1best -e sockeye_cpu

When this is finished, we have the translations in rs1/ted_test1_en-zh.tok.en.1best. We can now compute the BLEU score by:

../../../tools/multi-bleu.perl ../multitarget-ted/en-zh/tok/ted_test1_en-zh.tok.en < rs1/ted_test1_en-zh.tok.en.1best

This should give a BLEU score of around 10.58.

4. Train systems for different language pairs

Let's repeat the above steps (1-3) on Arabic (ar). First return to the parent directory ($rootdir/egs/ted/) and setup the task.

cd ../ 
sh ./1_setup_task.sh ar
cd ar-en

Now, we run preprocess and train:

../../../scripts/preprocess-bpe.sh rs1.hpm
qsub -S /bin/bash -V -cwd -q gpu.q -l gpu=1,h_rt=12:00:00,num_proc=2 -j y ../../../scripts/train.sh -p rs1.hpm -e sockeye_gpu

Finally, we translated and measure BLEU:

qsub -S /bin/bash -V -cwd -q gpu.q -l gpu=1,h_rt=00:30:00 -j y ../../../scripts/translate.sh -p rs1.hpm -i ../multitarget-ted/en-ar/tok/ted_test1_en-ar.tok.ar -o rs1/ted_test1_en-ar.tok.en.1best -e sockeye_gpu
../../../tools/multi-bleu.perl ../multitarget-ted/en-ar/tok/ted_test1_en-ar.tok.en < rs1/ted_test1_en-ar.tok.en.1best

The BLEU score should be around 20.52.

For different source languages, just replace ar with the language code, e.g. de.

The test set BLEU scores of various tasks are:

Task	rs1
ar-en	20.52
de-en	25.65
fa-en	14.62
ko-en	8.45
ru-en	16.18
zh-en	10.58

(TODO: the BLEU numbers need to be updated to reflect various changes. These are currently here just as reference)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multitarget TED Talks

1. Setup

2. Preprocessing and Training

3. Evaluation

4. Train systems for different language pairs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multitarget TED Talks

1. Setup

2. Preprocessing and Training

3. Evaluation

4. Train systems for different language pairs