Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
In this paper, we introduce Open-Vocabulary Attention Maps (OVAM), a training-free extension for text-to-image diffusion models to generate text-attribution maps based on open vocabulary descriptions. Additionally, we introduce a token optimization process for the creation of accurate attention maps, improving the performance of existing semantic segmentation methods based on diffusion cross-attention maps.
Create a new virtual or conda environment (if applicable) and activate it. For example, using venv
:
# Install a Python environment (ensure 3.8 or higher)
python -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel
Install PyTorch with a compatible CUDA or other backend and Diffusers 0.20. In our experiments, we tested the code on Ubuntu with CUDA 11.8 and on MacOS with an MPS backend.
# Install PyTorch with CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# Or Pytorch with MPS backend for MacOS
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
Clone project's code and install dependencies
git clone [email protected]:vpulab/ovam.git
cd ovam
pip install . # or `pip install -e .` for live installation
Or directly from GitHub
pip install git+https://github.com/vpulab/ovam.git
The Jupyter notebook examples/getting_started.ipynb contains a full example of how to use OVAM with Stable Diffusion. Or try it on Colab. In this section, we will show a simplified version of the local notebook.
Import related libraries and load Stable Diffusion:
import torch
import matplotlib.pyplot as plt
from diffusers import StableDiffusionPipeline
from ovam.stable_diffusion import StableDiffusionHooker
from ovam.utils import set_seed
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe = pipe.to("mps") #mps, cuda, ...
Generate an image with Stable Diffusion and store the attention maps using OVAM hooker:
with StableDiffusionHooker(pipe) as hooker:
set_seed(123456)
out = pipe("monkey with hat walking")
image = out.images[0]
Extract attention maps for the attribution prompt monkey with hat walking and mouth
:
ovam_evaluator = hooker.get_ovam_callable(
expand_size=(512, 512)
) # You can configure OVAM here (aggregation, activations, size, ...)
with torch.no_grad():
attention_maps = ovam_evaluator("monkey with hat walking and mouth")
attention_maps = attention_maps[0].cpu().numpy() # (8, 512, 512)
Eight attention maps have been generated for the tokens: 0:<SoT>, 1:monkey, 2:with, 3:hat, 4:walking, 5:and, 6:mouth, 7:<EoT>
. Plot attention maps for words monkey
, hat
and mouth
:
# Get maps for monkey, hat and mouth
monkey = attention_maps[1]
hat = attention_maps[3]
mouth = attention_maps[6]
# Plot using matplotlib
fig, (ax0, ax1, ax2, ax3) = plt.subplots(1, 4, figsize=(20, 5))
ax0.imshow(image)
ax1.imshow(monkey, alpha=monkey / monkey.max())
ax2.imshow(hat, alpha=hat / hat.max())
ax3.imshow(mouth, alpha=mouth / mouth.max())
plt.show()
Result (matplotlib code simplified, full in examples/getting_started.ipynb):
The OVAM library includes code to optimize the tokens to improve the attention maps. Given an image generated with Stable Diffusion using the text a photograph of a cat in a park
, we optimized a cat token for obtaining a mask of the cat in the image (full example in the notebook).
This token can be later used for generating a mask of the cat in other testing images. For example, in this image generated with the text cat perched on the sofa looking out of the window
.
The current code has been tested with Stable Diffusion 1.5, 2.0 base, and 2.1 in Diffusers 0.20. We provide a module ovam/base with utility classes to adapt OVAM to other Diffusion Models.
The datasets generated in the experiments can be found at this url.
We want to thank the authors of DAAM, HuggingFace, PyTorch, RunwayML (Stable Diffusion 1.5), DatasetDM, DiffuMask and Grounded Diffusion.
Marcos-Manchón, P., Alcover-Couso, R., SanMiguel, J. C., & Martínez, J. M. (2024, June). Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9242–9252.
@InProceedings{Marcos-Manchon_2024_CVPR,
author = {Marcos-Manch\'on, Pablo and Alcover-Couso, Roberto and SanMiguel, Juan C. and Mart{\'\i}nez, Jos\'e M.},
title = {Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {9242-9252}
}