Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Look into lib/env.mk and adust any settings that you need in your environment. For CSC-users: adjust lib/env/puhti.mk and lib/env/mahti.mk to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following papers if you use OPUS-MT software and models:

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato\
, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
backtranslate		backtranslate
bt-tatoeba		bt-tatoeba
doc		doc
evaluate		evaluate
finetune		finetune
ft-tatoeba		ft-tatoeba
html		html
lib		lib
models		models
pivoting		pivoting
scripts		scripts
tatoeba		tatoeba
testsets		testsets
tools		tools
work-tatoeba		work-tatoeba
.gitmodules		.gitmodules
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
Makefile		Makefile
NOTES.md		NOTES.md
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

About

Releases

Packages

Contributors 5

Languages

License

Helsinki-NLP/OPUS-MT-train

Folders and files

Latest commit

History

Repository files navigation

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages