1.Kaldi (Data preparation related function script)Github link
2.Espnet-0.10.4
3.Modify the installation address of espnet in the path.sh file
git clone -b 5.4 https://github.com/kaldi-asr/kaldi.git kaldi
cd kaldi/tools/; make; cd ../src; ./configure; make
git clone -b v.0.10.4 https://github.com/espnet/espnet.git
cd espnet/tools/ # change to tools folder
ln -s {kaldi_root} # Create link to Kaldi. e.g. ln -s home00/lijing/kaldi/
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh # install conda
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda create -y -n new_envo python=3.8.16 # create a environment
conda activate new_envo
Clone or download this repository and set it as working directory,open path.sh file, change your espnet directory.
e.g. MAIN_ROOT=MAIN_ROOT=/mnt/data/lj/oriange/espnet-master
install Whisper
pip install git+https://github.com/openai/whisper.git
Load the pretrained model
import whisper
whisper.load("base.en")
whisper.load("small.en")
whisper.load("medium.en")
whisper.load("large-v2")
If you are not interested in kaldi or you are not interested in the generation of alignment information, you can skip this data preparation and proceed to the next step, we have provided Alignment information in the [dump](link: https://pan.baidu.com/s/1ZbTqaC5E8eOzDtEHQg8EKg extract code: 7777). You can download the data set first, and then put the decompressed file directly under the 'pronunciation_whisper-main' main directory.
Downlod the speechocean762 dataset from speechocean762. Use your own Kaldi ASR model or public Kaldi ASR model (e.g., the Librispeech ASR Chain Model we used) and run Kaldi GOP recipe following its instruction. After the run finishs,you should be able to see ali_test and ali_train under the exp directory, which are the generated alignment information files, you can use the following command to extract the alignment information of the training set and test set.
kaldi_path=your_kaldi_path
cd ${kaldi_path}/egs/speechocean/exp/ali_test
zcat ali-phone.{1..25}.gz > ali-test-phone.txt
cd ${kaldi_path}/egs/speechocean/exp/ali_train
zcat ali-phone.{1..25}.gz > ali-train-phone.txt
Other files can remain unchanged, you can use it directly.
1.Before running, you need to move the corresponding files of espnet2 to the directory corresponding to your 'espnet/espnet2' directory.
espnet_path=your_espnet_path
cd pronunciation_whisper-main/espnet2
cp -r espnet2/asr/encoder/whisper_encoder_gop.py ${espnet_path}/espnet2/asr/encoder
cp -r espnet2/asr/espnet_gop_multitask_model_whisper_adapter.py ${espnet_path}/espnet2/asr
cp -r espnet2/bin/gop_whisper_adapter.py ${espnet_path}/espnet2/bin
cp -r espnet2/tasks/gop_whisper_adapter.py ${espnet_path}/espnet2/tasks
cp -r espnet2/train/trainer_gop.py ${espnet_path}/espnet2/train
Run the following script.
bash run_lm_multi_whisper_three_adapter.sh
Results, best model, and log will be saved in the specified in .exp_whisper/