Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness
Homepage | Arxiv | Demo Video
The code is tested on NVIDIA GeForce RTX 4090
and CUDA Version: 12.2
. The environment is as follows:
conda env create -f environment_priormdm.yml
conda activate PriorMDM
pip install bvh librosa essentia pydub praat-parselmouth torchgeometry moviepy matplotlib==3.1.3
pip install smplx[all]
pip install git+https://github.com/openai/CLIP.git
Download the pre-trained model from Google Disk or Baidu Disk and place it in the ./save
folder.
python -m sample.double_take --save_dir '' --guidacnce_param 1 --model_name model001000000 --BEAT_wav_feat ./datasets/BEAT/my_wav_feat/ --HUMANML3D_text_feat ./datasets/SMPLX/HumanML3D/v3_HUMANML3D_txt_feat/ --clip_model_path ./data/clip --vis_mode customized_controls
Then you can find the generated video in the ./save/my_v3_0/model001000000
folder.
Video result
positions_vis1_1.0_customized_controls.mp4
You can use the following command to generate the video with audio:
python -m sample.double_take --save_dir '' --guidacnce_param 1 --model_name model001000000 --BEAT_wav_feat ./datasets/BEAT/my_wav_feat/ --HUMANML3D_text_feat ./datasets/SMPLX/HumanML3D/v3_HUMANML3D_txt_feat/ --clip_model_path ./data/clip --vis_mode vis_controls
python -m process.merge_mp4_audio --video_file ./save/my_v3_0/model001000000/positions_vis1_1.0_vis_controls.mp4
Video result with audio
positions_vis1_1.0_vis_controls-with-audio.mp4
(Optional) You might use human_body_prior and mdm_motion2smpl.py
generate SMPLX motion (without hands/fingers) from the generated file (Note that you need to modify mdm_motion2smpl.py
and the environment of human_body_prior
is tested on NVIDIA GeForce RTX 2080 Ti
and CUDA Version: 12.2
):
python ../human_body_prior/tutorials/mdm_motion2smpl.py --input ./save/my_v3_0/model001000000/result_rec_1.0.npy --output ./save/my_v3_0/model001000000/result_rec_1.0_smplx.npz
And then you can use blender to view the SMPLX motion.
Video result converted to SMPLX
0001-0690.mp4
Text2motion Text and Mapping we have provided in the ./prepare
folder.
Download Text2Motion motion files in SMPLX format from AMASS and place them in the ./datasets/SMPLX/
folder:
python -m prepare.prepare --smplx_folder ./datasets/SMPLX/
cd prepare
unzip texts.zip
python map_index.py --smplx_folder ./datasets/SMPLX/ --processed_motion_path ./datasets/SMPLX/HumanML3D/motion_data/processed/ --processed_text_path ./datasets/SMPLX/HumanML3D/text_data/processed
The total number of text-motion (SMPLX) pairs after processing is 13248.
Download updated BEAT from here.
cd ../process
python BEAT2smplx.py --source_BEAT_path ../datasets/BEAT/beat_english_v0.2.1/ --save_BEAT_smplx_path ../datasets/BEAT/my_smplx
Download the WavLM Large and put it into ./data/wavlm_cache/
folder.
Download SMPL-X Model from here or from 2. Quick start.
# Adjust the orientation of the motion and downsample AMASS dataset
python process_amass.py --source_HumanML3D_motion ../datasets/SMPLX/HumanML3D/motion_data/processed --processed_motion ../datasets/SMPLX/HumanML3D/processed_motion/ --index_path ../prepare/index.csv
# Extract audio/text features and downsample BEAT dataset, split the dataset into train/val/test
bash process_dataset.sh "prepare" "../datasets/BEAT" "../datasets/SMPLX/HumanML3D" "../data/wavlm_cache/WavLM-Large.pt" "../data/clip" "../data/prcocessed_data"
# Convert the motion format of the SMPLX to position, and extract the motion features
bash process_SMPLX.sh "../support_data/dowloads/models/" '../datasets/BEAT/my_downsample' "../datasets/SMPLX/HumanML3D/"
# Generate h5 file and calculate the statistics of the motion
bash process_dataset.sh "generate_h5_file" "../datasets/BEAT" "../datasets/SMPLX/HumanML3D" "../data/wavlm_cache/WavLM-Large.pt" "../data/clip" "../data/prcocessed_data"
After this step, you should get v3_train.h5
, v3_mean.npy
and v3_std.npy
in ./data/prcocessed_data
fold.
cd ..
python -m train.train_mdm --save_dir save/my_v3_0 --overwrite --batch_size 256 --n_frames 180 --n_seed 0 --h5file_path ./data/prcocessed_data/v3_train.h5 --statistics_path ./data/prcocessed_data
Then you will get the model in ./save/my_v3_0
fold.
We noticed that the generated results sometimes have sudden changes in orientation, which may be related to the diversity of character motions in HUMANML3D, which may be optimized by data preprocessing or by better motion representation.
If you find this code useful in your research, please cite:
@inproceedings{
yang2024Freetalker,
title={Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness},
author={Sicheng Yang and Zunnan Xu and Haiwei Xue and Yongkang Cheng and Shaoli Huang and Mingming Gong and Zhiyong Wu},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024},
}
If you have any problem, please raise an issue or contact me at [email protected].