Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
For details see the paper
For packages, see environment.yaml.
conda env create -f environment.yaml
conda activate ldm
Clone the repo
git clone https://github.com/zyxElsa/ProSpect.git
Train ProSpect:
python main.py --base configs/stable-diffusion/v1-finetune.yaml
-t
--actual_resume ./models/sd/sd-v1-4.ckpt
-n <run_name>
--gpus 0,
--data_root /path/to/directory/with/images
See configs/stable-diffusion/v1-finetune.yaml
for more options
Download the pretrained Stable Diffusion Model and save it at ./models/sd/sd-v1-4.ckpt.
To generate new images, run ProSpect.ipynb
main(prompt = '*', \
ddim_steps = 50, \
strength = 0.6, \
seed=42, \
height = 512, \
width = 768, \
prospect_words = ['a teddy * walking in times square', # 10 generation ends\
'a teddy * walking in times square', # 9 \
'a teddy * walking in times square', # 8 \
'a teddy * walking in times square', # 7 \
'a teddy * walking in times square', # 6 \
'a teddy * walking in times square', # 5 \
'a teddy * walking in times square', # 4 \
'a teddy * walking in times square', # 3 \
'a teddy walking in times square', # 2 \
'a teddy walking in times square', # 1 generation starts\
], \
model = model,\
)
prompt
: text promt that injected into all stages.
A '*' in the prompt
will be replaced by prospect_words
, if the prospect_words
is not None.
Otherwise, '*' will be replaced by the learned token embedding.
Edit prospect_words
to change the prompts injected into different stages.
A '*' in the prospect_words
will be replaced by the learned token embedding.
For img2img, a content_dir
to the image, and a strength
for diffusion are needed.
Reference Image:
Content-aware T2I generation
main(prompt = '*', \
ddim_steps = 50, \
strength = 0.6, \
seed=42, \
height = 512, \
width = 768, \
prospect_words = ['a teddy * walking in times square', # 10 generation ends\
'a teddy * walking in times square', # 9 \
'a teddy * walking in times square', # 8 \
'a teddy * walking in times square', # 7 \
'a teddy * walking in times square', # 6 \
'a teddy * walking in times square', # 5 \
'a teddy * walking in times square', # 4 \
'a teddy * walking in times square', # 3 \
'a teddy walking in times square', # 2 \
'a teddy walking in times square', # 1 generation starts\
], \
model = model,\
)
with ProSpect:
Layout-aware T2I generation
main(prompt = '*', \
ddim_steps = 50, \
strength = 0.6, \
seed=41, \
height = 512, \
width = 512, \
prospect_words = ['a corgi sits on the table', # 10 generation ends\
'a corgi sits on the table', # 9 \
'a corgi sits on the table', # 8 \
'a corgi sits on the table', # 7 \
'a corgi sits on the table', # 6 \
'a corgi sits on the table', # 5 \
'a corgi sits on the table', # 4 \
'a corgi sits on the table', # 3 \
'a corgi sits on the table', # 2 \
'a corgi sits on the table *', # 1 generation starts\
], \
model = model,\
)
with ProSpect:
without ProSpect:
Material-aware T2I generation
main(prompt = '*', \
ddim_steps = 50, \
strength = 0.6, \
seed=42, \
height = 512, \
width = 768, \
prospect_words = ['a * dog on the table', # 10 generation ends\
'a * dog on the table', # 9 \
'a * dog on the table', # 8 \
'a * dog on the table', # 7 \
'a * dog on the table', # 6 \
'a dog on the table', # 5 \
'a dog on the table', # 4 \
'a dog on the table', # 3 \
'a dog on the table', # 2 \
'a dog on the table', # 1 generation starts\
], \
model = model,\
)
with ProSpect:
without ProSpect:
There are 4 ways to use ProSpect
1.Image editing: In img2img mode, modify the content, material, and style of the original image.
2.Prompt-to-prompt editing: No need to learn token embedding *, in txt2img mode, use text to guide image generation, modify prompts at different stages to modify different attributes of generated images.
3.Attribute guidance: In txt2img mode, first learn token embedding *. Use * to represent the learned image concept and control the image attributes corresponding to different stages.
4.Flexible attribute guidance: In txt2img mode, learn token embedding * first. When an item of the input list 'prospect_words' is an int type of 0-9, the token embedding at this position will be replaced with the learned token embedding * at the corresponding position.
Enjoy!
Download the training images here(8.1M).
Differences between (a) standard textual conditioning P and (c) the proposed prompt spectrum conditioning P*. Instead of learning global textual conditioning for the whole diffusion process, ProSpect obtains a set of different token embeddings delivered from different denoising stages. Textual Inversion loses most of the fidelity. Compared with DreamBooth that generates cat-like objects in the images, ProSpect can separate content and material, and is more fit for attribute-aware T2I image generation.
Experimental results showing that different attributes exist at different steps. (a) Results of removing prompts 'a profile of a furry parrot' of different steps. (b) Results of adding material attribute 'yarn' and color attribute 'blue'. (c) Results of removing style attribute 'Monet' and 'Picasso'.
The visualization results of token embeddings 𝑝𝑖 obtained by ProSpect. The results show that the initial generation step of the diffusion model is sensitive to structural information (e.g., bird’s pose, pot’s shape). As the number of steps increases, the obtained 𝑝𝑖 gradually captures detailed information (e.g., the sideways head of the bird → bird’s wing → the texture of the bird’s feathers).
Comparisons with state-of-the-art personalization methods including Textual Inversion (TI), DreamBooth, XTI, and Perfusion. The bold words correspond to the additional concepts added to each image, (e.g. the 3rd column in (a) shows the result of 'A standing cat in a chef outfit', the 6th column in (b) shows the result of 'A tilting cat wearing sunglasses'). The resulting images of XTI and Perfusion are borrowed from their paper, so the results of adding concepts are not shown. Our method is faithful to convey the appearance and material of the reference image while having better controllability and diversity.
@article{zhang2023prospect,
title={ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models},
author={Zhang, Yuxin and Dong, Weiming and Tang, Fan and Huang, Nisha and Huang, Haibin and Ma, Chongyang and Lee, Tong-Yee and Deussen, Oliver and Xu, Changsheng},
journal={ACM Transactions on Graphics (TOG)},
volume={42},
number={6},
pages={244:1--244:14},
year={2023},
publisher={ACM New York, NY, USA}
}
Please feel free to open an issue or contact us personally if you have questions, need help, or need explanations. Write to one of the following email addresses, and maybe put one other in the cc: