GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

[Paper] [Data] [Model] | ACL 2024, Oral

This work proposes a generative paradigm for translation tasks that leverages LLMs to generate higher-quality translation results based on the N-best hypotheses decoded from foundation model (e.g., SeamlessM4T-Large-V2). We also release a HypoTranslate dataset to support LLM finetuning, which contains over 592K pairs of N-best hypotheses and ground-truth translation in 11 languages. Experiments show that our GenTranslate significantly outperforms the state-of-the-art SeamlessM4T-Large-V2 on various speech and machine translation benchmarks.

Conda Environment Configuration

Our code is built based on lit-gpt, please refer to official tutorial to build the conda environment. Then, please install the required packages using following command:

pip install -r requirements.txt

Models and Checkpoints

For LLMs, please refer to tutorial for download and conversion, which support many mainstream LLMs like LLaMA-2 (we use Llama-2-7b-hf and Llama-2-13b-hf in this work);
For well-trained adapter checkpoints, please refer to our HuggingFace repo.

Dataset

We have released our HypoTranslate dataset at HuggingFace.

Finetune

We provide a finetuning script finetune.sh, please first enter it and specify some settings:

<your-conda-env>: your conda environment name;
dataset: training data source;
srclang: source language code;
tgtlang: target language code;
task: task id (options: st, mt);
seamless_size: SeamlessM4T size (options: large);
data_dir: data directory where the .pt files are put in;
llm_dir: llama checkpoint directory (options: Llama-2-7b-hf, Llama-2-13b-hf);

NOTE: please use Llama-2-7b-hf for x-en, and Llama-2-13b-hf for en-x;

Then, you can start finetuning by command:

bash finetune.sh

The trained adapter weights will be saved in runs/gentrans_{dataset}_{srclang}_{tgtlang}_{task}_{seamless_size}/.

Inference

We provide an inference script infer.sh, please first enter it and specify some settings:

<your-conda-env>: your conda environment name;
dataset: test data source;
srclang: source language code;
tgtlang: target language code;
task: task id (options: st, mt);
seamless_size: SeamlessM4T size (options: large, largev2);
data_dir: data directory where the .pt files are put in;
llm_dir: llama checkpoint directory (options: Llama-2-7b-hf, Llama-2-13b-hf);
adapter_path: path of well-trained adapter checkpoint (.pth file);

NOTE: please use Llama-2-7b-hf for x-en, and Llama-2-13b-hf for en-x;

Now, you can run inference on your specified language pair by:

bash infer.sh

You will see the BLEU results of GenTranslate on your specified test set.

References

If you consider this work would be related or useful for your research, please kindly consider to cite the work below. Thank you.

@inproceedings{hu2024gentranslate,
    title = "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators",
    author = "Hu, Yuchen and Chen, Chen and Yang, Chao-Han Huck and Li, Ruizhe and Zhang, Dong and Chen, Zhehuai and Chng, Eng Siong",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    publisher = "Association for Computational Linguistics",
    year = "2024"
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
chat		chat
eval		eval
finetune		finetune
generate		generate
inference		inference
lit_gpt		lit_gpt
notebooks		notebooks
pretrain		pretrain
quantize		quantize
scripts		scripts
tests		tests
tutorials		tutorials
xla		xla
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
finetune.sh		finetune.sh
infer.sh		infer.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Conda Environment Configuration

Models and Checkpoints

Dataset

Finetune

Inference

References

About

Releases

Packages

Contributors 2

Languages

License

YUCHEN005/GenTranslate

Folders and files

Latest commit

History

Repository files navigation

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Conda Environment Configuration

Models and Checkpoints

Dataset

Finetune

Inference

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages