简体中文 | English
By [1] Tsinghua University, [2]Chinese Institute for Brain Research.
- Kai Li[1], Runxuan Yang[1], Fuchun Sun[1], Xiaolin Hu[1,2].
This repository is an official implementation of the IIANet accepted to ICML 2024 (Poster).
-
We propose an attention-based cross-modal speech separation network called IIANet, which extensively uses intra-attention (IntraA) and inter-attention (InterA) mechanisms within and across the speech and video modalities.
-
Compared with existing CNN and Transformer methods, IIANet achieves significantly better separation quality on three audio-visual speech separation datasets while greatly reducing computational complexity and memory usage.
-
A faster version, IIANet-fast, surpasses CTCNet by 1.1 dB on the challenging LRS2 dataset with only 11% MACs of CTCNet.
-
Qualitative evaluations on real-world YouTube scenarios show that IIANet generates higher-quality separated speech than other separation models.
- Clone the repository:
git clone https://github.com/JusperLee/IIANet.git
cd IIANet/
- Create and activate the conda environment:
conda create -n iianet python=3.8
conda activate iianet
-
Install PyTorch and torchvision following the official instructions. The code requires
python>=3.8
,pytorch>=1.11
,torchvision>=0.13
. -
Install other dependencies:
pip install -r requirements.txt
We evaluate IIANet and its fast version IIANet-fast on three datasets: LRS2, LRS3, and VoxCeleb2. The results show that IIANet achieves significantly better speech separation quality than existing methods while maintaining high efficiency [1].
Method | Dataset | SI-SNRi | SDRi | PESQ | Params | MACs | GPU Infer Time | Download |
---|---|---|---|---|---|---|---|---|
IIANet | LRS2 | 16.0 | 16.2 | 3.23 | 3.1 | 18.6 | 110.11 ms | Config/Model |
IIANet | LRS3 | 18.3 | 18.5 | 3.28 | 3.1 | 18.6 | 110.11 ms | Config/Model |
IIANet | VoxCeleb2 | 13.6 | 14.3 | 3.12 | 3.1 | 18.6 | 110.11 ms | Config/Model |
For single video inference, please refer to inference.py
.
# Inference on a single video
# You can modify the video path in inference.py
python inference.py
Before starting training, please modify the parameter configurations in configs
.
A simple example of training configuration:
data_config:
train_dir: DataPreProcess/LRS2/tr
valid_dir: DataPreProcess/LRS2/cv
test_dir: DataPreProcess/LRS2/tt
n_src: 1
sample_rate: 16000
segment: 2.0
normalize_audio: false
batch_size: 3
num_workers: 24
pin_memory: true
persistent_workers: false
Use the following commands to start training:
python train.py --conf_dir configs/LRS2-IIANet.yml
python train.py --conf_dir configs/LRS3-IIANet.yml
python train.py --conf_dir configs/Vox2-IIANet.yml
To evaluate a model on one or more GPUs, specify the CUDA_VISIBLE_DEVICES
, dataset
, model
and checkpoint
:
python test.py --conf_dir checkpoints/lrs2/conf.yml
python test.py --conf_dir checkpoints/lrs3/conf.yml
python test.py --conf_dir checkpoints/vox2/conf.yml
- Validate the effectiveness and robustness of IIANet on larger-scale datasets such as AVSpeech.
- Further optimize the architecture and training strategies of IIANet to improve speech separation quality while reducing computational costs.
- Explore the applications of IIANet in other multimodal tasks, such as speech enhancement, speaker recognition, etc.
If you find our work helpful, please consider citing:
@inproceedings{lee2024iianet,
title={IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation},
author={Kai Li and Runxuan Yang and Fuchun Sun and Xiaolin Hu},
booktitle={International Conference on Machine Learning},
year={2024}
}