Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> Spanish	French -> Portuguese -> Italian

Youtube | Bilibili

Preparation

Clone the repository and enter it locally:

git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages

Setup the environment.

conda create -y -n autoavsr python=3.8
conda activate autoavsr

Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:

pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Download and extract a pre-trained model and/or language model from model zoo to:

./benchmarks/${dataset}/models
./benchmarks/${dataset}/language_models

[For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]

[config_filename] is the model configuration path, located in ./configs.
[labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.
[data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.
gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]

data_filename is the path to the audio/video file.
detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]

dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

Lip Reading Sentences 3 (LRS3)

Components	WER	url	size (MB)
Visual-only
-	19.1	GoogleDrive or BaiduDrive(key: dqsy)	891
Audio-only
-	1.0	GoogleDrive or BaiduDrive(key: dvf2)	860
Audio-visual
-	0.9	GoogleDrive or BaiduDrive(key: sai5)	1540
Language models
-	-	GoogleDrive or BaiduDrive(key: t9ep)	191
Landmarks
-	-	GoogleDrive or BaiduDrive(key: mi3c)	18577

VSR for multiple languages models

Lip Reading Sentences 2 (LRS2)

Components	WER	url	size (MB)
Visual-only
-	26.1	GoogleDrive or BaiduDrive(key: 48l1)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 53rc)	9358

Lip Reading Sentences 3 (LRS3)

Components	WER	url	size (MB)
Visual-only
-	32.3	GoogleDrive or BaiduDrive(key: 1b1s)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: mi3c)	18577

Chinese Mandarin Lip Reading (CMLR)

Components	CER	url	size (MB)
Visual-only
-	8.0	GoogleDrive or BaiduDrive(key: 7eq1)	195
Language models
-	-	GoogleDrive or BaiduDrive(key: k8iv)	187
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 1ret)	3721

CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)

Components	WER	url	size (MB)
Visual-only
Spanish	44.5	GoogleDrive or BaiduDrive(key: m35h)	186
Portuguese	51.4	GoogleDrive or BaiduDrive(key: wk2h)	186
French	58.6	GoogleDrive or BaiduDrive(key: t1hf)	186
Language models
Spanish	-	GoogleDrive or BaiduDrive(key: 0mii)	180
Portuguese	-	GoogleDrive or BaiduDrive(key: l6ag)	179
French	-	GoogleDrive or BaiduDrive(key: 6tan)	179
Landmarks
-	-	GoogleDrive or BaiduDrive(key: vsic)	3040

GRID

Components	WER	url	size (MB)
Visual-only
Overlapped	1.2	GoogleDrive or BaiduDrive(key: d8d2)	186
Unseen	4.8	GoogleDrive or BaiduDrive(key: ttsh)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 16l9)	1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

Lombard GRID

Components	WER	url	size (MB)
Visual-only
Unseen (Front Plain)	4.9	GoogleDrive or BaiduDrive(key: 38ds)	186
Unseen (Side Plain)	8.0	GoogleDrive or BaiduDrive(key: k6m0)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: cusv)	309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

TCD-TIMIT

Components	WER	url	size (MB)
Visual-only
Overlapped	16.9	GoogleDrive or BaiduDrive(key: jh65)	186
Unseen	21.8	GoogleDrive or BaiduDrive(key: n2gr)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: bnm8)	930

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
benchmarks		benchmarks
configs		configs
doc		doc
espnet		espnet
hydra_configs		hydra_configs
pipelines		pipelines
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crop_mouth.py		crop_mouth.py
eval.py		eval.py
infer.py		infer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Speech Recognition for Multiple Languages

Authors

Update

Introduction

Tutorial

Demo

Preparation

Benchmark evaluation

Speech prediction

Mouth ROIs cropping

Model zoo

Overview

AutoAVSR models

VSR for multiple languages models

Citation

License

Contact

About

Releases 1

Packages

Languages

License

mpc001/Visual_Speech_Recognition_for_Multiple_Languages

Folders and files

Latest commit

History

Repository files navigation

Visual Speech Recognition for Multiple Languages

Authors

Update

Introduction

Tutorial

Demo

Preparation

Benchmark evaluation

Speech prediction

Mouth ROIs cropping

Model zoo

Overview

AutoAVSR models

VSR for multiple languages models

Citation

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages