Implementation of Random Noising Xie et al. 2018 ( on Fairseq.
This is a data augmentation method for grammatical error correction (GEC).
First, please install fairseq. I use v0.12.2.
pip install fairseq==v0.12.2
git clone -b v0.12.2
cd fairseq
pip install -e .
If you want to use fairseq v0.10.2, please use code of commit 1db9c58.
Then, please install xiebt.
pip install git+
git clone
cd xiebt
pip install -e .
First, you train an error generating model. You can get this by training the reversed GEC (target -> source).
Next, run fairseq-preprocess
like this.
fairseq-preprocess \
--source-lang src \
--target-lang trg \
--testpref monolingual_data.txt \
--srcdict dict.src.txt \
cp data-bin/dict.src.txt data-bin/dict.trg.txt # This is fairseq's fault.
Then, run this.
xiebt-generate \
data-bin \
--path \
--seed 12345 \
--beam 4 \
--max-tokens 6000 \
--beta-random 8.0
As for beta-random, 6.0 is used in Kiyono et al. 2019 ( and 8.0 in Koyama et al. 2021 ( So this example above is not a recommendation of back-translation. If you want to know the optimal condition of back-translation, you have to search it by yourself.