Preprossed data for workshop on statistical machine translation (WMT), collected from other places
When reimplement the NMT models, I found the data of WMT14/15/16/17 are raw data provided on the homepage, and it is not easy to find processed data which is exactly the paper used. So I creat this repository to collect the processed WMT data I met, which I am sure met the requirment of papers I read.
The data is provided at: http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/
The data is used in these papers:
- Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014)
- Kyunghyun Cho, Bart van Merriënboer, Ça˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
The data is provided at: https://nlp.stanford.edu/projects/nmt/
The data is used in these papers:
- (exactly the data) Thang Luong, Hieu Pham, Christopher D. Manning: Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015: 1412-1421
- (similar to the data) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.
- (similar to the data) Stephan Peitz, Joern Wuebker, Markus Freitag, and Hermann Ney. The rwth aachen german-english machine translation system for wmt 2014. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
- (similar to the data) Tao Lei, Yu Zhang: Training RNNs as Fast as CNNs. CoRR abs/1709.02755 (2017)
WMT 2014 English to German data had updated the data of News Commentary v10
The data is provided at:https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz The data is used in this tutorial for OpenNMT: http://forum.opennmt.net/t/training-english-german-wmt15-nmt-engine/29
The homepage provide preprocssed data: http://data.statmt.org/wmt17/translation-task/preprocessed/