Skip to content

Latest commit

 

History

History
114 lines (79 loc) · 7.24 KB

README.md

File metadata and controls

114 lines (79 loc) · 7.24 KB

Adaptive-MT-LLM-Fine-tuning

Code and data for the paper Fine-tuning Large Language Models for Adaptive Machine Translation

The paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose large language model (LLM), for adaptive machine translation (MT). The fine-tuning process involves utilizing a combination of zero-shot and one-shot translation prompts within the medical domain. Zero-shot prompts represet regular translation without any context, while one-shot prompts augment the new source with a similar translation pair, i.e. a fuzzy match, to improve the adherence to terminology and style of the domain The primary objective is to enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt translations to the required domain at inference time. Our experiments demonstrate that, with a relatively small dataset of 20,000 segments that incorporate a mix of zero-shot and one-shot prompts, fine-tuning significantly enhances Mistral's in-context learning ability, especially for real-time adaptive MT.

Dependencies

You might want to install the latest versions of the used libraries, but if you are facing issues, try the versions used in the requirements file.

pip3 install -r requirements.txt

Data (training and test)

The original dataset is a mix of medical datasets from OPUS, namely ELRC, EMEA, SciELO, and TICO-19.

Training data (small)

  • Fine-tuning data - small [ES][EN]: Data for actual fine-tuning: 10,000 translation pairs
  • Context Dataset [ES][EN]: Data for fuzzy match retrieval for training: 50,000 translation pairs
  • Retrieved data: Data after retrieval for training: 10,000 entries (format: {score} ||| {fuzzy_src_sent} ||| {new_src_sent} ||| {fuzzy_tgt_sent})

Test Data

  • Test dataset [ES][EN]: Data used for actual inference/translation: 10,000 translation pairs
  • Context Dataset [ES][EN]: Data for fuzzy match retrieval for testing: 50,000 translation pairs
  • Retrieved data: Data after retrieval for testing: 10,000 entries (format: {score} ||| {fuzzy_src_sent} ||| {new_src_sent} ||| {fuzzy_tgt_sent})

Data Processing

The original dataset is a mix of medical datasets from OPUS, namely ELRC, EMEA, SciELO, and TICO-19. The pre-processing step mainly removes duplicates and too long sentences. The code for data pre-processing is at Data-Processing-Adaptive-MT.ipynb

Fuzzy Match Retrieval

We use Sentence-Transformers with a multilingual model, namely Microsoft’s “Multilingual-MiniLM-L12-H384”, to generate the embeddings for the datasets. For indexing, we use Faiss. Then we retrieve fuzzy matches through semantic search. You can find more details about the retrieval process in our paper. The code of this fuzzy match retrieval process is at Retrieve-Fuzzy-Matches-Faiss-Adaptive-MT.ipynb

Fine-tuning Mistral 7B

We used QLoRA for efficient fine-tuning with 4bit quantization, with Hugging Face Transformers. You can find more details in the paper and the notebook Mistral-Fine-Tuning-Adaptive-MT.ipynb

Prompts are created in this notebook using the create_prompt() function. If one_shot=False it creates a zero-shot translation prompt; otherwise, it creates a one-shot translation prompt. Please check out the notebook itself for actual examples.

Inference

Conversion to the CTranslate2 format

  • Mistral 7B (baseline): To convert Mistral baseline (before fine-tuning) to the CTranslate2 format:
ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ct2-mistral-7B-v0.1
ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir ct2/nllb-200-distilled-600M-int8

Tokenizers

!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model

Translation

Evaluation

Evaluation was done based on BLEU, chrF++, TER, and COMET metrics. The code is available at Evaluation-Adaptive-MT.ipynb. The full evaluation scores are available at the paper under the Results section, and a detailed version is at Evaluation-Scores-Adaptive-MT.csv

Questions

If you have questions, please feel free to contact me.

Citations

  1. Fine-tuning Large Language Models for Adaptive Machine Translation
@ARTICLE{Moslem2023-Finetuning-LLM-AdaptiveMT,
  title         = "{Fine-tuning Large Language Models for Adaptive Machine Translation}",
  author        = "Moslem, Yasmin and Haque, Rejwanul and Way, Andy",
  month         =  dec,
  year          =  2023,
  url           = "http://arxiv.org/abs/2312.12740",
  archivePrefix = "arXiv",
  primaryClass  = "cs.CL",
  eprint        = "2312.12740"
}
  1. Adaptive Machine Translation with Large Language Models
@INPROCEEDINGS{Moslem2023-AdaptiveMT,
  title     = "{Adaptive Machine Translation with Large Language Models}",
  booktitle = "{Proceedings of the 24th Annual Conference of the European Association
               for Machine Translation}",
  author    = "Moslem, Yasmin and Haque, Rejwanul and Kelleher, John D and Way, Andy",
  publisher = "European Association for Machine Translation",
  pages     = "227--237",
  month     =  jun,
  year      =  2023,
  url       = "https://aclanthology.org/2023.eamt-1.22",
  address   = "Tampere, Finland"
}