Skip to content

Latest commit

 

History

History
70 lines (52 loc) · 5.51 KB

README.md

File metadata and controls

70 lines (52 loc) · 5.51 KB

Biomed-NMT-EngPol

Repository containing source code, data and results for paper "A comparison of data filtering techniques for English-Polish neural machine translation in the biomedical domain", started as final project for Advanced Natural Language Processing and Deep Learning, course at IT University of Copenhagen and further developed for publication.

Experiments were run on LUMI supercomputer and full report can be found here: paper_report.pdf.

Authors

The authors of the projects are the following:

Title

"A comparison of data filtering techniques for English-Polish neural machine translation in the biomedical domain"

Abstract

Large Language Models (LLMs) have become the state-of-the-art in Neural Machine Translation (NMT), often trained on massive bilingual parallel corpora scraped from the web, which contain low-quality entries and redundant information, leading to overall significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques—LASER, MUSE, and LaBSE—on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

Methods & Results

We employed three widely used multilingual language-agnostic embedding models — LASER, MUSE, LaBSE — to filter and subset the medical-domain corpus using cosine similarity. Based on these scores we kept 20% ($\sim$ 150k pairs) and 60% ($\sim$ 420k pairs) of the best scoring sentences for each method. These six (two sizes per method) filtered datasets were used to create fine-tuned mBART50 models.

To allow us to compare the effect the effect of filtering, a baseline model was trained on the full unfiltered dataset (Base-all), and further models were trained on three randomly selected 20% and 60% subsets of the data (Base-20% and Base-60%) that is, three models trained for every sample size whose evaluation performance will be averaged.

Best performing model can be downloaded in this link.

Evaluation was conducted on an independent dataset using SacreBLEU method for the BLEU metric. Results:

Repository structure overview

These files are present in root folder:

The following folders are present in the repo:

Contains all light-weight files used in the project, like evaluaiton results and filtered ids from original corpus.

This folder contains all scripts, logs and slurm jobs executed on LUMI supercomputer during experimentation.

The following scripts are worth mentioning:

  • tokenize_dataset.py: script to generate tokenized splits from different sources: train, validation and test sets.
  • train.py: script to execute training of mBART50
  • evaluate.py: script to execute evaluation on test set

This folder hosts Python scripts and notebooks run locally used for several tasks of the project.