A curated (continually updated) list of Text-to-Video studies. It's based on our survey paper: From Sora What We Can See: A Survey of Text-to-Video Generation. In this survey, We have conducted a comprehensive exploration of existing works in the Text-to-Video field using OpenAI’s Sora as a clue, and we have also summarized 24 datasets and 9 evaluation metrics in this field. Specifically, we discussed the problems existing in this research area and Sora itself, combined with the advantages of Sora and the characteristics of related fields to provide future research directions. If our work can inspire you, feel free to cite our paper and star our repo.
This project is curated and maintained by Rui Sun and Yumin Zhang.
@article{sun2024sora,
title={From Sora What We Can See: A Survey of Text-to-Video Generation},
author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
journal={arXiv preprint arXiv:2405.10674},
year={2024}
}
Topics of this repo cover:
Text-to-Seq-Image
,Text-to-Video
- LivePhoto: Real Image Animation with Text-guided Motion Control
Team: HKU, Alibaba Group, Ant Group.
Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao
arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page] - Scalable Diffusion Models with Transformers
Sequential Images
Team: UC Berkeley, NYU.
William Peebles, Saining Xie
ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
-
Zero-Shot Video Editing through Adaptive Sliding Score Distillation
Video Editing
Team: Nanjing University.
Lianghan Zhu, Yanqi Bao, Jing Huo, et al., Yang Gao
arXiv, 2024.06 [Paper], [PDF], [Home Page] -
CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion
Team: University of Science and Technology of China.
Xingrui Wang, Xin Li, Zhibo Chen
arXiv, 2024.06 [Paper], [PDF], [Home Page] -
VideoTetris: Towards Compositional Text-to-Video Generation
Team: Peking University.
Ye Tian, Ling Yang, Haotian Yang, et al., Bin Cui
arXiv, 2024.06 [Paper], [PDF], [Code], [Home Page] -
Searching Priors Makes Text-to-Video Synthesis Better
Team: Zhejiang University.
Haoran Cheng, Liang Peng, Linxuan Xia, et al., Boxi Wu
arXiv, 2024.06 [Paper], [PDF], [Home Page] -
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
3DGS Task
Team: KAIST, ByteDance.
Inkyu Shin, Qihang Yu, Xiaohui Shen, et al., Liang-Chieh Chen
arXiv, 2024.06 [Paper], [PDF], [Home Page] -
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Team: Tsinghua University.
Tianchen Zhao, Tongcheng Fang, Enshu Liu, et al., Yu Wang
arXiv, 2024.06 [Paper], [PDF], [Home Page] -
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Team: Computer Vision Laboratory, ECE & IPAI, Seoul National University
Jihwan Kim, Junoh Kang, Jinyoung Cho, Bohyung Han
arXiv, 2024.05 [Paper], [PDF], [Code], [Home Page] -
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
Team: UCLA, Google.
Hritik Bansal, Yonatan Bitton, Michal Yarom, et al., Kai-Wei Chang
arXiv, 2024.05 [Paper], [PDF], [Code], [Dataset], [Pretrained Model], [Home Page] -
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Robotics
Team: Tsinghua University.
Jialong Wu, Shaofeng Yin, Ningya Feng, et al., Mingsheng Long
arXiv, 2024.05 [Paper], [PDF], [Code], [Home Page] -
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Team: Peking University, University of Rochester.
Shenghai Yuan, Jinfa Huang, Yujun Shi, et al., Li Yuan, Jiebo Luo
arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] -
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Team: Snap Inc, University of Trento.
Willi Menapace, Aliaksandr Siarohin, et al., Sergey Tulyakov
arXiv, 2024.02 [Paper], [PDF], [Home Page] -
Video generation models as world simulators
Team: Sora, Open AI.
Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh
online page, 2024.02 [Paper], [Home Page] -
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Team: University of Waterloo.
Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
World Model on Million-Length Video And Language With RingAttention
Long Video
Team: UC Berkeley.
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
Team: Peking University.
Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang
arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page] -
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
Team: Bytedance Inc.
Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng
arXiv, 2024.01 [Paper], [PDF], [Home Page] -
UniVG: Towards UNIfied-modal Video Generation
Team: Baidu Inc.
Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao
arXiv, 2024.01 [Paper], [PDF], [Home Page] -
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Team: HiDream.ai Inc.
Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
arXiv, 2024.01 [Paper], [PDF], [Home Page] -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Team: Tencent AI Lab.
Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan
arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
Lumiere: A Space-Time Diffusion Model for Video Generation
Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion.
Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri
arXiv, 2024.01 [Paper], [PDF], [Home Page] -
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Team: HUST, Alibaba Group.
Xiang Wang, Shiwei Zhang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Home Page] -
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Team: Fudan University, Alibaba Group, HUST, Zhejiang University.
Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] -
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.12 [Paper], [PDF] -
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
Training-free
Team: Victoria University of Wellington, NVIDIA
Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)] -
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Training-free
Team: Nanyang Technological University
Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)] -
MTVG : Multi-text Video Generation with Text-to-Video Models
Training-free
Team: Korea University, NVIDIA
Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)] -
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Team: HUST, Alibaba Group, Zhejiang University, Ant Group
Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] -
InstructVideo: Instructing Video Diffusion Models with Human Feedback
Team: Zhejiang University, Alibaba Group, Tsinghua University
Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] -
VideoLCM: Video Latent Consistency Model
Team: HUST, Alibaba Group, SJTU
Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] -
Photorealistic Video Generation with Diffusion Models
Team: Stanford University Fei-Fei Li, Google.
Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama
arXiv, 2023.12 [Paper], [PDF], [Home Page] -
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Team: HUST, Alibaba Group, Fudan University.
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
Team: HKU, Meta.
Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua
arXiv, 2023.12 [Paper], [PDF], [Home Page] -
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
Team: Tsinghua University, Tencent AI Lab, CUHK.
Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)] -
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Multimodal
Team: Tencent.
Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu
arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis
Training-free
Team: University of Electronic Science and Technology of China.
Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
arXiv, 2023.11 [Paper], [PDF] -
AdaDiff: Adaptive Step Selection for Fast Diffusion
Training-free
Team: Fudan University.
Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
arXiv, 2023.11 [Paper], [PDF] -
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
Training-free
Team: University of Technology Sydney.
Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] -
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Training-free
Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] -
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] -
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou
arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset] -
ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
Team: University of Science and Technology of China, Microsoft.
Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)] -
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Team: Stability AI.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach
arXiv, 2023.11 [Paper], [PDF], [Code] -
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Team: Sber AI.
Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov
arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)] -
MoVideo: Motion-Aware Video Generation with Diffusion Models
Team: ETH, Meta.
Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan
arXiv, 2023.11 [Paper], [PDF], [Home Page] -
Optimal Noise pursuit for Augmenting Text-to-Video Generation
Team: Zhejiang Lab.
Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang
arXiv, 2023.11 [Paper], [PDF] -
Make Pixels Dance: High-Dynamic Video Generation
Team: ByteDance.
Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] -
Learning Universal Policies via Text-Guided Video Generation
Team: MIT, Google DeepMind, UC Berkeley.
Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel
NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page] -
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Team: Meta.
Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)] -
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
Team: Show Lab, National University of Singapore, Zhejiang University
Rui Zhao, Yuchao Gu, et al., Mike Zheng Shou
ECCV'24 (Oral), 2023.10 [Paper], [PDF], [Code], [Home Page] -
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
Training-free
Team: Nanyang Technological University.
Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu
ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] -
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
Training-free
Team: Shanghai Artificial Intelligence Laboratory.
Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Team: Tencent AI Lab.
Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] -
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
Team: Shanghai Artificial Intelligence Laboratory.
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] -
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Team: The Chinese University of Hong Kong.
Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)] -
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
Team: Nankai University, MEGVII Technology.
Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
LLM-grounded Video Diffusion Models
Training-free
Team: UC Berkeley.
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page] -
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Team: UNC Chapel Hill.
Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
arXiv, 2023.09 [Paper], [PDF], [Code] -
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Team: Baidu Inc.
Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang
arXiv, 2023.09 [Paper], [PDF], [Home Page] -
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Team: Shanghai Artificial Intelligence Laboratory.
Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] -
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Team: Huawei.
Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] -
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Training-free
Team: School of Information Science and Technology, ShanghaiTech University.
Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang
NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page] -
Show-1: Marrying pixel and latent diffusion models for text-to-video generation.
Team: Show Lab, National University of Singapore
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou
arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model] -
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Team: Institute of Automation, Chinese Academy of Sciences (CASIA).
Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu
NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] -
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
Training-free
Team: East China Normal University.
Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang
arXiv, 2023.08 [Paper], [PDF], [Home Page] -
SimDA: Simple Diffusion Adapter for Efficient Video Generation
Team: Fudan University, Microsoft.
Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page] -
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
Team: National University of Singapore.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
arXiv, 2023.08 [Paper], [PDF], [Code] -
ModelScope Text-to-Video Technical Report
Team: Alibaba Group.
Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang
arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)] -
Dual-Stream Diffusion Net for Text-to-Video Generation
Team: Nanjing University of Science and Technology.
Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang
arXiv, 2023.08 [Paper], [PDF] -
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Team: The Chinese University of Hong Kong.
Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai
ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Team: HKUST.
Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen
arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] -
Probabilistic Adaptation of Text-to-Video Models
Team: Google, UC Berkeley.
Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel
arXiv, 2023.06 [Paper], [PDF], [Home Page] -
ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
Team: School of Artificial Intelligence, University of Chinese Academy of Sciences.
Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu
IJCNN'23, 2023.06 [Paper], [PDF] -
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Team: CUHK.
Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong
arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
VideoComposer: Compositional Video Synthesis with Motion Controllability
Team: Alibaba Group.
Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou
NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] -
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan
CVPR'23, arXiv, 2023.06 [Paper], [PDF] -
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
Training-free
Team: Korea University.
Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim
arXiv, 2023.05 [Paper], [PDF] -
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
Team: Carnegie Mellon University.
Rohan Dhesikan, Vignesh Rajmohan
arXiv, 2023.05 [Paper], [PDF], [Code(coming)] -
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
Team: University of Maryland.
Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji
ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page] -
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
Team: NUS, CUHK.
Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] -
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Team: Google Research
Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang
arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog] -
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning
Team: Tsinghua University, Beijing Film Academy
Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu
arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] -
Text2Performer: Text-Driven Human Video Generation
Team: Nanyang Technological University
Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu
arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] -
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Team: University of Rochester, Meta.
Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin
arXiv, 2023.04 [Paper], [PDF], [Home Page] -
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Team: Tsinghua University, HKUST.
Yue Ma, Yingqing He, Xiaodong Cun, et al., Qifeng Chen
AAAI'24, arXiv, 2023.04 [Paper], [PDF], [Home Page], [Code] -
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Team: NVIDIA.
Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis
CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page] -
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
Team: University of Science and Technology of China, Microsoft.
Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan
arXiv, 2023.03 [Paper], [PDF], [Home Page] -
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Team: Picsart AI Resarch (PAIR).
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi
arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)] -
Structure and Content-Guided Video Synthesis with Diffusion Models
Team: Runway
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page] -
SceneScape: Text-Driven Consistent Scene Generation
Team: Weizmann Institute of Science, NVIDIA Research
Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page] -
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Team: Renmin University of China, Peking University, Microsoft Research
Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo
CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code] -
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model] -
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Team: ByteDance Inc.
Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng
arXiv, 2022.11 [Paper], [PDF], [Home Page] -
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Long Video
Team: HKUST, Tencent AI Lab.
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page] -
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Team: UC Santa Barbara, Meta.
Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell
CVPR'23, arXiv, 2022.11 [Paper], [PDF] -
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Team: Google.
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan
ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page] -
Imagen Video: High Definition Video Generation with Diffusion Models
Team: Google.
Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans
arXiv, 2022.10 [Paper], [PDF], [Home Page] -
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Story Visualization
Team: UNC Chapel Hill.
Adyasha Maharana, Darryl Hannan, Mohit Bansal
ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)] -
Make-A-Video: Text-to-Video Generation without Text-Video Data
Team: Meta AI.
Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman
ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code] -
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Team: S-Lab, SenseTime.
Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu
TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo] -
Word-Level Fine-Grained Story Visualization
Story Visualization
Team: University of Oxford.
Bowen Li, Thomas Lukasiewicz
ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model] -
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Team: Tsinghua University.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)] -
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Team: Tsinghua University.
Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page] -
Long video generation with time-agnostic vqgan and time-sensitive transformer
Team: Meta AI.
Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh
ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code] -
Video Diffusion Models
text-conditioned
Team: Google.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet
arXiv, 2022.04 [Paper], [PDF], [Home Page] -
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Long Video
Team: Microsoft.
Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan
NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page] -
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Team: Microsoft.
Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan
ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code] -
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
Team: Microsoft, Duke University.
Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan
arXiv, 2021.04 [Paper], [PDF] -
Cross-Modal Dual Learning for Sentence-to-Video Generation
Team: Tsinghua University.
Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu
ACM MM'19 [Paper], [PDF] -
IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation
Team: Peking University.
Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng
IJCAI'19 [Paper], [PDF] -
Imagine this! scripts to compositions to videos
Team: University of Illinois Urbana-Champaign, AI2, University of Washington.
Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi
ECCV'18, arxiv, 2018.04 [Paper], [PDF] -
To Create What You Tell: Generating Videos from Captions
Team: USTC, Microsoft Research.
Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei
ACM MM'17, arxiv, 2018.04 [Paper], [PDF] -
Neural Discrete Representation Learning.
Team: DeepMind.
Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu
NeurIPS'17, arxiv, 2017.11 [Paper], [PDF] -
Video Generation From Text.
Team: Duke University, NEC Labs America.
Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin
AAAI'18, arxiv, 2017.10 [Paper], [PDF] -
Attentive semantic video generation using captions.
Team: IIT Hyderabad.
Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian
ICCV'17, arxiv, 2017.08 [Paper], [PDF] -
Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
VAE
Team: IIT Hyderabad.
Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian
ACM MM'17, arXiv, 2016.11 [Paper], [PDF]
Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct
.
Metrics are divided as image-level, video-level
.
Dataset | Domain | Annotated | #Clips | #Sent | Len_C(s) | Len_S | #Videos | Resolution | FPS | Dur(h) | Year | Source |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CV-Text | Face | Generated | 70K | 1400K | - | 67.2 | - | 480P | - | - | 2023 | Online |
MSR-VTT | Open | Manual | 10K | 200K | 15.0s | 9.3 | 7.2K | 240P | 30 | 40 | 2016 | YouTube |
DideMo | Open | Manual | 27K | 41K | 6.9s | 8.0 | 10.5K | - | - | 87 | 2017 | Flickr |
Y-T-180M | Open | ASR | 180M | - | - | - | 6M | - | - | - | 2021 | YouTube |
WVid2M | Open | Alt-text | 2.5M | 2.5M | 18.0 | 12.0 | 2.5M | 360P | - | 13K | 2021 | Web |
H-100M | Open | ASR | 103M | - | 13.4 | 32.5 | 3.3M | 720P | - | 371.5K | 2022 | YouTube |
InternVid | Open | Generated | 234M | - | 11.7 | 17.6 | 7.1M | *720P | - | 760.3K | 2023 | YouTube |
H-130M | Open | Generated | 130M | 130M | - | 10.0 | - | 720P | - | - | 2023 | YouTube |
Y-mP | Open | Manual | 10M | 10M | 54.2 | - | - | - | - | 150K | 2023 | Youku |
V-27M | Open | Generated | 27M | 135M | 12.5 | - | - | - | - | - | 2024 | YouTube |
P-70M | Open | Generated | - | 70.8M | 8.5 | 13.2 | 70.8M | 720P | - | 166.8K | 2024 | YouTube |
ChronoMagic-Pro | Open | Generated | - | - | 234.1 | - | 460K | 720P | - | 30.0K | 2024 | YouTube |
LSMDC | Movie | Manual | 118K | 118K | 4.8s | 7.0 | 200 | 1080P | - | 158 | 2017 | Movie |
MAD | Movie | Manual | - | 384K | - | 12.7 | 650 | - | - | 1.2K | 2022 | Movie |
UCF-101 | Action | Manual | 13K | - | 7.2s | - | - | 240P | 25 | 27 | 2012 | YouTube |
ANet-200 | Action | Manual | 100K | - | - | 13.5 | 2K | *720P | 30 | 849 | 2015 | YouTube |
Charades | Action | Manual | 10K | 16K | - | - | 10K | - | - | 82 | 2016 | Home |
Kinetics | Action | Manual | 306K | - | 10.0s | - | 306K | - | - | - | 2017 | YouTube |
ActNet | Action | Manual | 100K | 100K | 36.0s | 13.5 | 20K | - | - | 849 | 2017 | YouTube |
C-Ego | Action | Manual | - | - | - | - | 8K | 240P | - | 69 | 2018 | Home |
SS-V2 | Action | Manual | - | - | - | - | 220.1K | - | 12 | - | 2018 | Daily |
How2 | Instruct | Manual | 80K | 80K | 90.0 | 20.0 | 13.1K | - | - | 2000 | 2018 | YouTube |
HT100M | Instruct | ASR | 136M | 136M | 3.6 | 4.0 | 1.2M | 240P | - | 134.5K | 2019 | YouTube |
YCook2 | Cooking | Manual | 14K | 14K | 19.6 | 8.8 | 2K | - | - | 176 | 2018 | YouTube |
E-Kit | Cooking | Manual | 40K | 40K | - | - | 432 | *1080P | 60 | 55 | 2018 | Home |
-
(ShareGPT4Video) ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Dataset (adding)
Team: USTC, CUHK, PKU, Shanghai AI Lab.
Lin Chen, Xilin Wei, Jinsong Li, et al., Jiaqi Wang
arXiv, 2024.06 [Paper], [PDF], [Code], [Dataset], [Home Page] -
(ChronoMagic-Pro) ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Team: Peking University, University of Rochester.
Shenghai Yuan, Jinfa Huang, et al., Jiebo Luo, Li Yuan
arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] -
(VideoPhy) VideoPhy: Evaluating Physical Commonsense for Video Generation
Dataset (adding)
Team: University of California Los Angeles, Google Research.
Hritik Bansal, Zongyu Lin, Tianyi Xie, et al., Aditya Grover
arXiv, 2024.06 [Paper], [PDF], [Code], [Hugging Face], [Home Page] -
(GenAI-Bench) Evaluating Text-to-Visual Generation with Image-to-Text Generation
Dataset (adding)
Team: CMU, Meta.
Zhiqiu Lin, Deepak Pathak, Baiqi Li, et al., Deva Ramanan
arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] -
(VidProM) VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Dataset (adding)
Team: ReLER Lab.
Wenhao Wang, Yi Yang
arXiv, 2024.03 [Paper], [PDF], [Code], [Hugging Face] -
(ECTV) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Dataset (adding)
Team: Tencent AI Lab, CUHK.
Yaofang Liu, Xiaodong Cun, Xuebo Liu, et al., Ying Shan
CVPR'24, arXiv, 2023.10 [Paper], [PDF], [Code], [Dataset], [Home Page] -
(CV-Text) Celebv-text: A large-scale facial text-video datase
Dataset (Domain:Face)
Team: University of Sydney, SenseTime Research.
Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu
CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page] -
(MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language
Dataset (Domain:Open)
Team: Microsoft Research.
Jun Xu , Tao Mei , Ting Yao and Yong Rui
CVPR'16 [Paper], [PDF] -
(DideMo) Localizing moments in video with natural language
Dataset (Domain:Open)
Team: UC Berkeley, Adobe
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell
ICCV'17, arXiv, 2017.08 [Paper], [PDF] -
(YT-Tem-180M) Merlot: Multimodal neural script knowledge models
Dataset (Domain:Open)
Team: University of Washington
Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi
NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page] -
(WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Dataset (Domain:Open)
Team: University of Oxford, CNRS.
Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page] -
(HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Dataset (Domain:Open)
Team: Microsoft Research Asia.
Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo
CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code] -
(InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation
Dataset (Domain:Open)
Team: Shanghai AI Laboratory.
Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao
arXiv, 2023.07 [Paper], [PDF], [Code] -
(HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Dataset (Domain:Open)
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.05 [Paper], [PDF] -
(Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Dataset (Domain:Open)
Team: DAMO Academy, Alibaba Group.
Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang
arXiv, 2023.06 [Paper], [PDF] -
(VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Dataset (Domain:Open)
Team: UCAS, CAS
Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu
NeurIPS'23, arXiv, 2023.05 [Paper], [PDF] -
(Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Dataset (Domain:Open)
Team: Snap Inc., University of California, University of Trento.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov
arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page] -
(LSMDC) Movie description
Dataset (Domain:Movie)
Team: Max Planck Institute for Informatics.
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele
IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page] -
(MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions
Dataset (Domain:Movie)
Team: KAUST, Adobe Research.
Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem
CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code] -
(UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild
Dataset (Domain:Action)
Team: University of Central Florida.
Khurram Soomro, Amir Roshan Zamir, Mubarak Shah
arXiv, 2012.12 [Paper], [PDF], [Data] -
(ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding
Dataset (Domain:Action)
Team: Universidad del Norte, KAUST
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles
CVPR'15, [Paper], [PDF], [Home Page] -
(Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding
Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta
ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page] -
(Kinetics) The kinetics human action video dataset
Dataset (Domain:Action)
Team: Google
Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman
arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(ActivityNet) Dense-captioning events in videos
Dataset (Domain:Action)
Team: Stanford University
Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles
ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos
Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari
arXiv, 2018.04, [Paper], [PDF], [Home Page] -
(SS-V2) The "something something" video database for learning and evaluating visual common sense
Dataset (Domain:Action)
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic
ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page] -
(How2) How2: a large-scale dataset for multimodal language understanding
Dataset (Domain:Instruct)
Team: Carnegie Mellon University.
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze
arXiv, 2018.11 [Page], [PDF] -
(HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Dataset (Domain:Instruct)
Team: Ecole Normale Superieure, Inria, CIIRC.
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic
arXiv, 2019.06 [Page], [PDF], [Home Page] -
(YouCook2) Towards automatic learning of procedures from web instructional video
Dataset (Domain:Cooking)
Team: University of Michigan, University of Rochester
Luowei Zhou, Chenliang Xu, Jason J. Corso
AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page] -
(Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset
Dataset (Domain:Cookding)
Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto.
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray
ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page] -
(PSNR/SSIM) Image quality assessment: from error visibility to structural similarity
Metric (image-level)
Team: New York University.
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli
IEEE TIP, 2004.04. [Paper], [PDF] -
(IS) Improved techniques for training gans
Metric (image-level)
Team: OpenAI
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen
NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code] -
(FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium
Metric (image-level)
Team: Johannes Kepler University Linz
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter
NeurIPS'17, arXiv, 2017.06 [Paper], [PDF] -
(CLIP Score) Learning transferable visual models from natural language supervision
Metric (image-level)
Team: OpenAI.
Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever
ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code] -
(Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan
Metric (video-level)
Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi
IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code] -
(FVD/KVD) FVD: A new metric for video generation
Metric (video-level)
Team: Johannes Kepler University, Google
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly
ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code] -
(FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Metric (video-level)
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]
If you find this repository useful, please consider citing our paper and this list:
@article{sun2024sora,
title={From Sora What We Can See: A Survey of Text-to-Video Generation},
author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
journal={arXiv preprint arXiv:2405.10674},
year={2024}
}
@misc{sun2024t2vgenerationlist,
title={Awesome-Text-to-Video-Generation},
author={Sun, Rui and Zhang, Yumin},
year={2024},
publisher={GitHub},
howpublished={\url{https://github.com/soraw-ai/Awesome-Text-to-Video-Generation}},
}