Skip to content

EIT-NLP/BLEUless_DocMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning—But BLEU Turns a Blind Eye

Resources&Datasets

Paper WMT22

Models

Vicuna-7B Vicuna-7B-16K Vicuna-13B Vicuna-13B-16K Mistral-7B

1. Introduction

This repository contains the code and data for our paper, "Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning—But BLEU Turns a Blind Eye". Our work explores the ability of instruction-tuned large language models (LLMs) to handle document-level machine translation (docMT) without requiring specialized document-level training. We assess whether instruction-tuned LLMs can translate entire documents in a single pass, achieving coherent and context-aware translations beyond sentence-level methods.

In contrast to prior studies focusing on sentence-by-sentence translation, we demonstrate that LLMs prompted to translate entire documents at once deliver higher-quality outputs, preserving document-level context and improving coherence. However, traditional n-gram metrics like BLEU fail to reflect this advantage, often favoring sentence-based translations. To address this evaluation gap, we propose an LLM-as-a-judge paradigm, where GPT-4 assesses translations based on coherence, accuracy, and fluency, offering a more nuanced and human-like evaluation.

2. Key Contributions

  • LLM-as-a-Judge Paradigm: We design tailored prompts for GPT-4 to assess document-level translation, capturing aspects of fluency, coherence, and accuracy that traditional metrics overlook.
  • Entire Document Translation V.S. Sentence-merged Translation: Our experiments show that translating entire documents in one pass yields more coherent and accurate results than independent sentences translations and then merged, even without fine-tuning for docMT.
  • Evaluation Insights: We recommend against using BLEU scores for docMT, as they fail to capture discourse-level coherence and can often produce misleading results, particularly in document-level evaluations.

3. Citation

@article{sun2024instruction,
  title={Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning--But BLEU Turns a Blind Eye},
  author={Sun, Yirong and Zhu, Dawei and Chen, Yanjun and Xiao, Erjia and Chen, Xinghao and Shen, Xiaoyu},
  journal={arXiv preprint arXiv:2410.20941},
  year={2024}
}

4. Contact

For questions or collaborations, please contact us at [email protected].

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages