Linear probes found controllable representations of scene attributes in a text-to-image diffusion model
Project page of "Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model"
Paper arXiv link: https://arxiv.org/abs/2306.05720
[NeurIPS link] [Poster link]
How to generate a short video of moving foreground object using a pretrained text-to-image generative model?
See application_of_intervention.ipynb for how to use our intervention technique to generate a short video of moving objects.
The gifs are sampled using the original text-to-image diffusion model without fine-tuning. All frames are generated using the same prompt, random seed (inital latent vectors), and model. We edited the intermediate activations of the latent diffusion model when it generated the images so its internal representtaion of foreground match with our reference mask. See notebook for implementation details.
Unzip the probe_checkpoints.zip to acquire all probe weights trained by us. The probe weights in the unzipped folder should be sufficient for you to run all experiments shown in the paper.
If you find the source code of this repo helpful, please cite
@article{chen2023beyond,
title={Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model},
author={Chen, Yida and Vi{\'e}gas, Fernanda and Wattenberg, Martin},
journal={arXiv preprint arXiv:2306.05720},
year={2023}
}