This is a PyTorch implementation for the ACL 2022 main conference paper STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation.
Let's first take a look at training an En-De model as an example.
- Clone this repository:
git clone [email protected]:ictnlp/STEMM.git
cd STEMM/
-
Install Montreal Forced Aligner following the official guidance. Please also download the pertained models and dictionary for MFA.
-
Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:
pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
pip install inflect sentencepiece soundfile textgrid pandas
- First make a directory to store the dataset:
TGT_LANG=de
MUSTC_ROOT=data/mustc/
mkdir -p $MUSTC_ROOT
- Download the MuST-C v1.0 archive
MUSTC_v1.0_en-de.tar.gz
to the$MUSTC_ROOT
path, and uncompress it:
cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-de.tar.gz
- Return to the root directory, run the preprocess script
preprocess.sh
, which will perform forced alignment and organize the raw data and alignment information into.tsv
format for using:
sh preprocess.sh $TGT_LANG
- Finally, the directory
$MUSTC_ROOT
should look like this:
.
├── en-de
│ ├── config_raw.yaml
│ ├── data
│ ├── dev_raw_seg_plus.tsv
│ ├── docs
│ ├── segment
│ ├── spm_unigram10000_raw.model
│ ├── spm_unigram10000_raw.txt
│ ├── spm_unigram10000_raw.vocab
│ ├── train_raw_seg_plus.tsv
│ ├── tst-COMMON_raw_seg_plus.tsv
│ ├── tst-HE_raw_seg_plus.tsv
└── MUSTC_v1.0_en-de.tar.gz
If you want to use external MT corpus, please first pretrain a MT model on this corpus following these steps:
- Perform BPE on external corpus with the sentencepiece model learned on MuST-C. As we mentioned in our paper, we use WMT for En-De, En-Fr, En-Ru, En-Es, En-Ro, and OPUS100 for En-Pt, En-It, En-Nl as external corpus. You can download them from the internet and put them in the
data/ext_en${TGT_LANG}/
directory. Run the following command and replace$input_file
with the path of raw text to perform BPE. You should apply BPE to texts in both source and target language of all subset (train/valid/test).
python3 data/scripts/apply_spm.py --input-file $input_file --output-file $output_file --model data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.model
- Use
fairseq-preprocess
command to convert the BPE texts into fairseq formats. Make sure to use the sentencepiece dictionary learned on MuST-C.
$spm_dict=data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.txt
fairseq-preprocess --source-lang en --target-lang $TGT_LANG --trainpref data/ext_en${TGT_LANG}/train --validpref data/ext_en${TGT_LANG}/valid --testpref data/ext_en${TGT_LANG}/test --destdir data/ext_en${TGT_LANG}/binary --joined-dictionary --srcdict $spm_dict --tgtdict $spm_dict --workers=20 --nwordssrc 10000 --nwordstgt 10000
- Train the model using the following command:
sh pretrain_mt_ext.sh $TGT_LANG
- Run the following script to pretrain the MT module. The argument
--load-pretrained-mt-encoder-decoder-from
indicates the path of MT model pretrained on external corpus obtained in the last step.
sh pretrain_mt.sh $TGT_LANG
- To ensure consistent performance, we have released our checkpoints of pretrained MT modules. You can download them and directly use them do initialize the MT module in our model for the following experiments.
- Download the pretrained wav2vec2.0 model from the official link, and put it in the
checkpoints/
directory. - Just run the training scripts:
sh train.sh $TGT_LANG
- Run the following script to average the last 10 checkpoints and evaluate on the
tst-COMMON
set:
sh test.sh mustc_en${TGT_LANG}_stmm_self_learning $TGT_LANG
- We also released our checkpoints as follows. You can download and evaluate them directly.
If this repository is useful for you, please cite as:
@inproceedings{fang-etal-2022-STEMM,
title = {STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation},
author = {Fang, Qingkai and Ye, Rong and Li, Lei and Feng, Yang and Wang, Mingxuan},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
year = {2022},
}
If you have any questions, feel free to contact me at [email protected]
.