Skip to content

Latest commit

 

History

History
98 lines (77 loc) · 3.24 KB

README.md

File metadata and controls

98 lines (77 loc) · 3.24 KB

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Training codes is released in xtuner, and more details will be completed in the near future. Thank you for your attention!

Introduction

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation~(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework

Model List

Model Speech Encoder LLM HuggingFace ModelScope
LLaST-2B Whisper-Large TinyLlama TBD TBD
LLaST-8B Whisper-Large Llama2-7B-Instruct TBD TBD

Training LLaST

Data Preparation

  • Download data from CommonVoice

  • Prepare tsv data as follows:

covost2/tsv
├── covost_v2.de_en.dev.tsv
├── covost_v2.de_en.test.tsv
  • Prepare the multi-lingual data as the follows
covost/audio
├── de
├── en
├── es
├── fr
├── it
├── ja
└── zh-CN
  • Prepare the audio data as the follows:
covost2/audio/fr/clips_16k
├── common_voice_fr_20241860.wav
├── common_voice_fr_20241864.wav
├── common_voice_fr_20241868.wav
├── common_voice_fr_20241872.wav
└── common_voice_fr_20241875.wav

Training with XTuner

  1. Install xtuner
git clone [email protected]:ChenX17/xtuner.git

cd xtuner

git checkout add_llast
  1. Training
export XTUNER_DATASET_TIMEOUT=120
export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/train.py worksapce/configs/llast_2b_tinyllama_chat.py  --deepspeed deepspeed_zero2

Evaluation

export HF_EVALUATE_OFFLINE=1 
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1 
python xtuner/tools/test.py worksapce/configs/llast_2b_tinyllama_chat.py --checkpoint work_dir/xxxx/epoch_1.pth/mp_rank_00_model_states.pt --laucher slurm

Citation

@inproceedings{chen2024llast,
  title = {LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models},
  author = {Chen, Xi and Zhang, Songyang and Bai, Qibing and Chen, Kai and Nakamura, Satoshi},
  booktitle = {Findings of the Association for Computational Linguistics (ACL),},
  year = {2024}
}

Acknowledgement