This section summarizes papers that are highly related to T2I generation according to different properties, e.g., prerequisites of T2I generation, diffusion models with other techniques (e.g., Diffusion Transformer, LLMs, Mamba, etc.), and diffusion models for other tasks.
All references of the summarized papers can be found in reference.bib
.
- Prerequisites
- Diffusion Models Meet LLMs
- Diffusion Models Meet Mamba
- Diffusion Models Meet Federated Learning
- Diffusion Transformer-based Methods
- Diffusion Models for Text Generation
Note
This section summarizes the essential background knowledge for text-to-image diffusion models, e.g., DDPM, DDIM, classifier-free guidance, latent diffusion models, etc.
- [NeurIPS 2020] DDPM: Denoising Diffusion Probabilistic Models [Paper] [Code] [Project]
- [J. Mach. Learn. Res. 2020] T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [Paper] [Code]
- [ICLR 2021] DDIM: Denoising Diffusion Implicit Models [Paper] [Code]
- [NeurIPS 2021] Classifier Guidance: Diffusion Models Beat GANs on Image Synthesis [Paper] [Code]
- [ICML 2021] CLIP: Learning Transferable Visual Models From Natural Language Supervision [Paper] [Code]
- [arXiv 2022] Classifier-Free Diffusion Guidance [Paper] [Reproduced Code]
Note
Large Language Models (LLM) have become a popular research direction with outstanding text processing ability. For more information, you may referred to my Zhihu blog.
This topic summarizes diffusion models that integrate LLMs.
- [arXiv 2023] ParaDiffusion: Paragraph-to-Image Generation with Information-Enriched Diffusion Model [Paper] [Code]
- [arXiv 2023] MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [Paper]
- [ACM MM 2024] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [Paper] [Code]
- [arXiv 2024] ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment [Paper] [Code] [Project]
Note
Mamba is a new state space model architecture with amazing performance on information-dense data, e.g., language modeling. This topic summarizes those papers that integrate Mamba with diffusion models.
- [arXiv 2023] Mamba: Linear-Time Sequence Modeling with Selective State Spaces [Paper] [Code]
- [arXiv 2024] DiS: Scalable Diffusion Models with State Space Backbone [Paper] [Code]
- [arXiv 2024] ZigMa: Zigzag Mamba Diffusion Model [Paper] [Code] [Project]
- [arXiv 2024] DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [Paper] [Code]
- [arXiv 2024] LinFusion: 1 GPU, 1 Minute, 16K Image [Paper] [Code] [Project] [Demo]
Note
Federated learning focuses on settings where multiple clients collaboratively train a model while ensuring their data decentralized. This topic summarizes those papers that integrate federated learning with diffusion models.
- [AAAI 2024] Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models [Paper]
- [ICLR 2024 Submission] Exploring the Effectiveness of Diffusion Models in One-Shot Federated Learning [Paper]
- [NeurIPS 2023] When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions [Paper]
- [ATC 2023] Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks [Paper]
- [arXiv 2023] Phoenix: Federated Learning for Generative Diffusion Model [Paper]
- [arXiv 2023] FedDiff: Diffusion Model Driven Federated Learning for Multi-Modal and Multi-Clients [Paper]
- [arXiv 2023] One-Shot Federated Learning with Classifier-Guided Diffusion Models [Paper]
Note
Diffusion Transformer (DiT) aims to use Transformer-based architecture to improve the backbone model (i.e., U-net) of diffusion models, where these architecture has been adopted by a wide series of related works, e.g., Sora, Stable Diffusion 3, PixArt series, etc. For more information, you may referred to my Zhihu blog.
This topic summarizes diffusion models that are based on DiT.
- [ICCV 2023] DiT: Scalable Diffusion Models with Transformers [Paper] [Code] [Project] [Demo]
- [ICML 2023] UniDiffusers: One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale [Paper] [Code]
- [ICCV 2023] MDTv1: Masked Diffusion Transformer is a Strong Image Synthesizer [Paper] [Code]
- [ICLR 2024] PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [Paper] [Code] [Project] [Demo]
- [TMLR 2024] MaskDiT: Fast Training of Diffusion Models with Masked Transformer [Paper] [Code]
- [arXiv 2024] FiT: Flexible Vision Transformer for Diffusion Model [Paper] [Code]
- [arXiv 2024] PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models [Paper] [Code]
- [arXiv 2024] PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation [Paper] [Code] [Project]
- [arXiv 2024] MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer [Paper] [Code]
Note
Diffusion models are utilized for text generation to tackle the limitation of autoregressive models, e.g., Transformer. However, it is tough to directly leverage standard diffusion models for text generation due to the discrete nature of natural language. Therefore, related works, e.g., D3PM, Bit Diffusion, Diffusion-LM, etc., are motivated to tackle the aforementioned problems.
This topic summarizes diffusion models that are designed for text generation.
- [NeurIPS 2021] D3PM: Structured Denoising Diffusion Models in Discrete State-Spaces [Paper]
- [NeurIPS 2022] Diffusion-LM Improves Controllable Text Generation [Paper] [Code]
- [arXiv 2022] DDCap: Exploring Discrete Diffusion Models for Image Captioning [Paper] [Code]
- [ACL 2023] DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models [Paper] [Code]
- [ICLR 2023] (Bit Diffusion) Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning [Paper] [Reproduced Code]
- [CVPR 2023] SCD-Net: Semantic-Conditional Diffusion Networks for Image Captioning [Paper] [Code]
- [ICLR 2023] DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models [Paper] [Code]
- [arXiv 2023] DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models [Paper] [Code]
- [arXiv 2023] DiffCap: Exploring Continuous Diffusion on Image Captioning [Paper]
- [arXiv 2023] GlyphDiffusion: Text Generation as Image Generation [Paper]