LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

[📄[Paper](https://arxiv.org/abs/2407.15415)]

Training codes is released in xtuner, and more details will be completed in the near future. Thank you for your attention!

Introduction

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation~(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework

Model List

Model	Speech Encoder	LLM	HuggingFace	ModelScope
LLaST-2B	Whisper-Large	TinyLlama	TBD	TBD
LLaST-8B	Whisper-Large	Llama2-7B-Instruct	TBD	TBD

Training LLaST

Data Preparation

Download data from CommonVoice
Prepare tsv data as follows:

covost2/tsv
├── covost_v2.de_en.dev.tsv
├── covost_v2.de_en.test.tsv

Prepare the multi-lingual data as the follows

covost/audio
├── de
├── en
├── es
├── fr
├── it
├── ja
└── zh-CN

Prepare the audio data as the follows:

covost2/audio/fr/clips_16k
├── common_voice_fr_20241860.wav
├── common_voice_fr_20241864.wav
├── common_voice_fr_20241868.wav
├── common_voice_fr_20241872.wav
└── common_voice_fr_20241875.wav

Training with XTuner

Install xtuner

git clone git@github.com:ChenX17/xtuner.git

cd xtuner

git checkout add_llast

Training

export XTUNER_DATASET_TIMEOUT=120
export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/train.py worksapce/configs/llast_2b_tinyllama_chat.py  --deepspeed deepspeed_zero2

Evaluation

export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/test.py worksapce/configs/llast_2b_tinyllama_chat.py --checkpoint work_dir/xxxx/epoch_1.pth/mp_rank_00_model_states.pt --laucher slurm

Citation

@inproceedings{chen2024llast,
  title = {LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models},
  author = {Chen, Xi and Zhang, Songyang and Bai, Qibing and Chen, Kai and Nakamura, Satoshi},
  booktitle = {Findings of the Association for Computational Linguistics (ACL),},
  year = {2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Introduction

Model List

Training LLaST

Data Preparation

Training with XTuner

Evaluation

Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Introduction

Model List

Training LLaST

Data Preparation

Training with XTuner

Evaluation

Citation

Acknowledgement