Data Selection in NMT

Welcome to the repository designed based on FAIR principles for the experiments described in: "Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts".

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

You can read the paper on ArXiv, ResearchGate, Publisher's website.

Abstract

Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.

Data Selection Tool

We also developed a Python tool that streamlines the process of selecting domain-specific data from generic corpora and training a domain-specific machine translation model. Our tool is particularly useful in scenarios where there is a dearth of domain-specific data or only monolingual data is available. Moreover, our tool is flexible and can handle varying sizes of domain-specific data. To learn more about this tool, please visit our GitHub repository at https://github.com/JoyeBright/DataSelection-NMT/tree/main/Tools_DS.

Our Pre-trained models on Hugging Face

System	Link	System	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

Note: Bandwidth for Git LFS of personal account is 1GB/month. If you're unable to download the models, follow this link.

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

4. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

or

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

5. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 4.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

6. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

Cite the paper

If you find this repository helpful, feel free to cite our publication:

@article{Pourmostafa Roshan Sharami_Sterionov_Spronck_2021, 
title={Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts}, 
volume={11}, 
url={https://www.clinjournal.org/clinj/article/view/137}, 
journal={Computational Linguistics in the Netherlands Journal}, 
author={Pourmostafa Roshan Sharami, Javad and Sterionov, Dimitar and Spronck, Pieter}, 
year={2021}, 
month={Dec.}, 
pages={213–230} }}

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
Data-Table1		Data-Table1
Selected-data-with-mixing		Selected-data-with-mixing
Selected-data-without-mixing		Selected-data-without-mixing
Tools		Tools
Tools_DS		Tools_DS
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Selection in NMT

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

Abstract

Data Selection Tool

Our Pre-trained models on Hugging Face

How to use

Authors

Cite the paper

About

Releases 1

Packages

Languages

JoyeBright/DataSelection-NMT

Folders and files

Latest commit

History

Repository files navigation

Data Selection in NMT

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

Abstract

Data Selection Tool

Our Pre-trained models on Hugging Face

How to use

Authors

Cite the paper

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages