[NeurIPS 2024] On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
This repository is the official implementation of MM-Det [NeurIPS 2024 Poster].
> |
- Install basic packages
conda create -n MM_Det python=3.10
conda activate
pip install -r requirements.txt
cd LLaVA
pip install -e .
- For training cases, install additional packages
cd LLaVA
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn==2.5.8 --no-build-isolation
We release Diffusion Video Forensics (DVF) as the benchmark for forgery video detection.
The full version of DVF can be downloaded via links: BaiduNetDisk(Code: 296c). google driver
We also release a tiny version of DVF for quickstart, in which each dataset contains 10 videos, with each video no more than 100 frames. This tiny version can be downloaded via BaiduNetDisk (Code:77x3). We also provide the corresponding reconsturction dataset and MM representations for evaluation in the above link. More information for evaluation can be found at here.
We provide the weights for our fine-tuned large multi-modal model, which is based on llava-v1.5-Vicuna-7b from LLaVA. The overall weights for MM-Det without the LMM can be obtained via weights at MM-Det/current_model.pth
. Please download and put the weights at ./weights/
.
For the full version of DVF, we provide a ready reconstruction dataset at BaiduNetDisk (Code: l8h4).
For the full version of DVF, we provide a ready dataset for cached MMFR at BaiduNetDisk (Code: m6uy). Since the representation is fixed during training and inference, it is recommended to cache the representation before the overall training to reduce time cost.
For evaluation on the tiny version of DVF, put all files of the tiny version into ./data
. The data structure is organized as follows:
-- data
| -- DVF_tiny
| -- DVF_recons_tiny # $RECONSTRUCTION_DATASET_ROOT
| -- mm_representations_tiny # $MM_REPRESENTATION_ROOT
For evaluation on the full version of DVF, download the data at Reconstruction Dataset and Multi-Modal Forgery Representation. Then put them into ./data
. The data structure is organized as follows:
-- data
| -- DVF
| -- DVF_recons # $RECONSTRUCTION_DATASET_ROOT
| -- mm_representations # $MM_REPRESENTATION_ROOT
For evaluation on customized dataset, details of data preparation can be found at dataset/readme.md.
Make sure the pre-trained weights are organized at ./weights
. Please set $RECONSTRUCTION_DATASET_ROOT
and $MM_REPRESENTATION_ROOT
as the data roots provided at Data Structure in launch-test.sh
. --cache-mm
is recommended for save the computational and memory cost of LMM branch. Then run launch-test.sh
for testing on 7 datasets respectively.
python test.py \
--classes videocrafter1 zeroscope opensora sora pika stablediffusion stablevideo \
--ckpt ./weights/MM-Det/current_model.pth \
--data-root $RECONSTRUCTION_DATASET_ROOT \
--cache-mm \
--mm-root $MM_REPRESENTATION_ROOT\
# when sample-size > 0, only [sample-size] videos are evaluated for each dataset for pattial evaluation.
--sample-size -1
Since the entire evaluation is time-costing, sample-size
can be specified (e.g., 1,000) to reduce time by conducting inference only on limited (1,000) videos. To finish the entire evaluation, please set sample-size
as -1
.
Known Issues:
- From the feedback we’ve received, we noticed a deviation in the training process when fine-tuning the large language model, making it difficult to reproduce our reported results fully in some cases. We are now resolving this issue and will share the updated training scripts soon. Currently, we provide the inference interface first.
We express our sincere appreciation to the following projects.
- LLaVA
- pytorch-image-models
- pytorch-vqvae
- Stable Diffusion
- VideoCrafter1
- Zeroscope
- OpenSora
- Stable Video Diffusion.
@inproceedings{ on-learning-multi-modal-forgery-representation-for-diffusion-generated-video-detection,
author = { Xiufeng Song and Xiao Guo and Jiache Zhang and Qirui Li and Lei Bai and Xiaoming Liu and Guangtao Zhai and Xiaohong Liu },
title = { On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection },
booktitle = { Proceeding of Thirty-eighth Conference on Neural Information Processing Systems },
address = { Vancouver, Canada },
month = { December },
year = { 2024 },
}