Official Implementation of WACV 2024 paper, "CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer"
Efficient and controllable head avatar generation from a single image with point-based neural rendering.
Reconstructing personalized animatable head avatars has significant implications in the fields of AR/VR. Existing methods for achieving explicit face control of 3D Morphable Models (3DMM) typically rely on multi-view images or videos of a single subject, making the reconstruction process complex. Additionally, the traditional rendering pipeline is time-consuming, limiting real-time animation possibilities. In this paper, we introduce CVTHead, a novel approach that generates controllable neural head avatars from a single reference image using point-based neural rendering. CVTHead considers the sparse vertices of mesh as the point set and employs the proposed Vertex-feature Transformer to learn local feature descriptors for each vertex. This enables the modeling of long-range dependencies among all the vertices. Experimental results on the VoxCeleb dataset demonstrate that CVTHead achieves comparable performance to state-of-the-art graphics-based methods. Moreover, it enables efficient rendering of novel human heads with various expressions, head poses, and camera views. These attributes can be explicitly controlled using the coefficients of 3DMMs, facilitating versatile and realistic animation in real-time scenarios.
conda create -n cvthead python=3.9
conda activate cvthead
pip install -r requirements.txt
cd data/
bash fetch_data.sh
cd ..
Please go to data/README.md
for more details.
Download our pre-trained model cvthead.pt
from Google Drive
and put it under data/
folder
Here is a demo to use CVTHead for cross-identity face reenactment
python inference.py --src_pth examples/1.png --drv_pth examples/2.png --out_pth examples/output.png --ckpt_pth data/cvthead.pt
Here is a demo for face generation under the control of FLAME coefficients
python inference.py --src_pth examples/1.png --out_pth examples --ckpt_pth data/cvthead.pt --flame
- Download videos Please refer video-preprocessing to download the video. We follow the original given bounding box from VoxCeleb1, rather than the union of a third-party detector bbox and the given bbox in this preprocessing step.
Note: some videos may no longer be available on Youtube. Due to the copyright and privacy issue, I cannot share these face videos to others.
- obtain per frame landmarks with
face_alignment
The data organization should looks like
--VoxCeleb1
---- vox_video
------ train
-------- xxxx.mp4
-------- ......
------ test
-------- id10280#NXjT3732Ekg#001093#001192.mp4
-------- xxxx.mp4
-------- ......
---- vox_lmks_meta
------ train
-------- xxxx.pkl
-------- ......
------ test
-------- id10280#NXjT3732Ekg#001093#001192.pkl
-------- xxxx.pkl
-------- ......
- We find that spliting the training into two separate stages can obtain more stable training curves. In the first stage, we train the model without the adversarial loss. In the second stage, we continue training the model with all losses claimed in the paper for a few epoches.
torchrun --standalone --nnodes 1 --nproc_per_node 2 main_stage1.py --config configs/vox1.yaml
torchrun --standalone --nnodes 1 --nproc_per_node 2 main_stage2.py --config configs/vox1.yaml
ROME
DECA
Spiralnet++
face-parsing.PyTorch
face-alignment
If you found this code helpful, please consider citing:
@article{ma2023cvthead,
title={CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer},
author={Ma, Haoyu and Zhang, Tong and Sun, Shanlin and Yan, Xiangyi and Han, Kun and Xie, Xiaohui},
journal={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2024}
}