Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs - ICML 2024
This repository contains the official implementation of our RPG, accepted by ICML 2024.
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui
Peking University, Stanford University, Pika Labs
Overview of our RPG |
Abstract: RPG is a powerful training-free paradigm that can utilize proprietary MLLMs (e.g., GPT-4, Gemini-Pro) or open-source local MLLMs (e.g., miniGPT-4) as the prompt recaptioner and region planner with our complementary regional diffusion to achieve SOTA text-to-image generation and editing. Our framework is very flexible and can generalize to arbitrary MLLM architectures and diffusion backbones. RPG is also capable of generating image with super high resolutions, here is an example:
[2024.1] Our main code along with the demo release, supporting different diffusion backbones (SDXL, SD v2.0/2.1 SD v1.4/1.5), and one can reproduce our good results utilizing GPT-4 and Gemini-Pro. Our RPG is also compatible with local MLLMs, and we will continue to improve the results in the future.
[2024.4] Our codebase has been updated based on diffusers, it now supports both ckpts and diffusers of diffusion models. As for diffusion backbones, one can use RegionalDiffusionPipeline for base models like SD v2.0/2.1 SD v1.4/1.5, and use RegionalDiffusionXLPipeline for SDXL.
[2024.10] We enhance RPG by incorporating a more powerful composition-aware backbone, IterComp, significantly improving performance on compositional generation without additional computational costs. Simply update the model path using the command below to obtain the results:
pipe = RegionalDiffusionXLPipeline.from_pretrained("comin/IterComp",torch_dtype=torch.float16, use_safetensors=True)
1024*1024 Examples
2048*1024 Example
1024*1024 Examples
2048*1024 Example
A green twintail girl in orange dress is sitting on the sofa while a messy desk under a big window on the left, a lively aquarium is on the top right of the sofa, realistic style |
Open Pose Example
Open Pose | |
Depth Map Example
Depth Map | |
Canny Edge Example
Canny Edge | |
1024*1024 Examples
Compared with RPG
1. Set Environment
git clone https://github.com/YangLing0818/RPG-DiffusionMaster
cd RPG-DiffusionMaster
conda create -n RPG python==3.9
conda activate RPG
pip install -r requirements.txt
git clone https://github.com/huggingface/diffusers
2. Download Diffusion Models and MLLMs
To attain SOTA generative capabilities, we mainly employ SDXL, SDXL-Turbo, and Playground v2 as our base diffusion. To generate images of high fidelity across various styles, such as photorealism, cartoons, and anime, we incorporate the models from CIVITA. For images aspiring to photorealism, we advocate the use of AlbedoBase XL , and DreamShaper XL. Moreover, we generalize our paradigm to SD v1.5 and SD v2.1. All checkpoints are accessible within our Hugging Face spaces, with detailed descriptions.
We recommend the utilization of GPT-4 or Gemini-Pro for users of Multilingual Large Language Models (MLLMs), as they not only exhibit superior performance but also reduce local memory. According to our experiments, the minimum requirements of VRAM is 10GB with GPT-4, if you want to use local LLM, it would need more VRAM. For those interested in using MLLMs locally, we suggest deploying miniGPT-4 or directly engaging with substantial Local LLMs such as Llama2-13b-chat and Llama2-70b-chat.
For individuals equipped with constrained computational resources, we here provide a simple notebook demonstration that partitions the image into two equal-sized subregions. By making minor alterations to select functions within the diffusers library, one may achieve commendable outcomes utilizing base diffusion models such as SD v1.4, v1.5, v2.0, and v2.1, as mentioned in our paper. Additionally, you can apply your customized configurations to experiment with a graphics card possessing 8GB of VRAM. For an in-depth exposition, kindly refer to our Example_Notebook.
Our method can automatically generates output without pre-storing MLLM responses, leveraging Chain-of-Thought reasoning and high-quality in-context examples to obtain satisfactory results. Users only need to specify some parameters. For example, to use GPT-4 as the region planner, we can refer to the code below, contained in the RPG.py ( Please note that we have two pipelines which support different model architectures, for SD v1.4/1.5/2.0/2.1 models, you should use RegionalDiffusionPipeline, for SDXL models, you should use RegionalDiffusionXLPipeline. ):
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to load ckpt, initialize with ".from_single_file".
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# If you want to use diffusers, initialize with ".from_pretrained".
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
## User input
prompt= ' A handsome young man with blonde curly hair and black suit with a black twintail girl in red cheongsam in the bar.'
para_dict = GPT4(prompt,key='...Put your api-key here...')
## MLLM based split generation results
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = "" # negative_prompt,
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= prompt,
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = None,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
prompt is the original prompt that roughly summarize the content of the image
base_prompt sets base prompt for generation, which is the summary of the image, here we set the base_prompt as the original input prompt by default
base_ratio is the weight of the base prompt
There are also other common optional parameters:
guidance_scale is the classifier-free guidance scale
num_inference_steps is the steps to generate an image
seed controls the seed to make the generation reproducible
It should be noted that we introduce some important parameters: base_prompt & base_ratio
After adding your prompt and api-key, and setting your path to downloaded diffusion model, just run the following command and get the results:
python RPG.py
FAQ: How to set --base_prompt & --base_ratio properly ?
If you want to generate an image with multiple entities with the same class (e.g., two girls, three cats, a man and a girl), you should use base prompt and set base prompt that includes the number of each class of entities in the image using base_prompt. Another relevant parameter is base_ratio which is the weight of the base prompt. According to our experiments, when base_ratio is in [0.35,0.55], the final results are better. Here is the generated image for command above:
And you will get an image similar to ours results as long as we have the same random seed:
Text prompt: A handsome young man with blonde curly hair and black suit with a black twintail girl in red cheongsam in the bar. |
On the other hand, when it comes to an image including multiple entities with different classes, there is no need to use base prompt, here is an example:
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to load ckpt, initialize with ".from_single_file".
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# #If you want to use diffusers, initialize with ".from_pretrained".
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
prompt= 'From left to right, bathed in soft morning light,a cozy nook features a steaming Starbucks latte on a rustic table beside an elegant vase of blooming roses,while a plush ragdoll cat purrs contentedly nearby,its eyes half-closed in blissful serenity.'
para_dict = GPT4(prompt,key='your key')
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = ""
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions, and the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= None, # If the base_prompt is None, the base_ratio will not work
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = None,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
And you will get an image similar to our results:
It's important to know when should we use base_prompt, if these parameters are not set properly, we can not get satisfactory results. We have conducted ablation study about base prompt in our paper, you can check our paper for more information.
We recommend to use base models with over 13 billion parameters for high-quality results, but it will increase load times and graphical memory use at the same time. We have conducted experiments with three different sized models. Here we take llama2-13b-chat as an example:
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to use single ckpt, use this pipeline
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# If you want to use diffusers, use this pipeline
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
prompt= 'Two girls are chatting in the cafe.'
para_dict = local_llm(prompt,model_path='path to your model')
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = ""
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions, and the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= prompt,
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = 1234,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
In local version, after adding your prompt and setting your path to diffusion model and your path to the local MLLM/LLM, just the command below to get the results:
python RPG.py
@inproceedings{yang2024mastering,
title={Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs},
author={Yang, Ling and Yu, Zhaochen and Meng, Chenlin and Xu, Minkai and Ermon, Stefano and Cui, Bin},
booktitle={International Conference on Machine Learning},
year={2024}
}
Our RPG is a general MLLM-controlled text-to-image generation/editing framework, which is builded upon several solid works. Thanks to AUTOMATIC1111, regional-prompter, SAM, diffusers and IA for their wonderful work and codebase! We also thank Hugging Face for sharing our paper.