Combining transformer with diffusion models #375

Jack000 · 2021-10-08T13:08:03Z

Jack000
Oct 8, 2021

Hi guys, just wanted to show you something I've been working on. Here's a link to the code and models: https://github.com/Jack000/guided-diffusion

For a while now I've been thinking about how to combine transformer models with DDPM. The intuition is that diffusion models are great at generating low-level texture but poor at global composition, while transformers have the opposite problem. It would be great if they could be combined somehow.

Anyways after a lot of trial and error I've found an approach that works pretty well. The idea is really simple - just feed the image latents from the VQGAN encoder into a conditional DDPM. So after training a DALLE-pytorch model with the VQGAN VAE, you can feed the image latent code into this diffusion model and get potentially much better image quality.

the diffusion model architecture is also really simple - it's exactly the same as the super-resolution model in guided diffusion, except where they concatenate the low res image channel-wise with the noised input, we can skip the encoder layers entirely and inject the latents into the middle block. This has the added benefit of allowing us to re-use the encoder and decoder weights from the pretrained model, so only the middle block needs to be re-trained.

I also tried some other approaches with varying degrees of success:

"deblurring" VAE image output with the diffusion model
This was the first and most obvious approach but the images always came out distorted, even with blurring + noise augmentation
cascaded super resolution
Errors made by the low-resolution model seems to be amplified by the upscaling process. I think you really need the condition augmentation from google SR3 to make this work. Although the 128x128 -> 64x64 -> 256x256 upscaling pipeline seems to work ok
putting the image embeddings in the encoder layers instead of the middle block
When I tried this the diffusion model quickly converged to produce nearly exactly the same image as the VAE, with errors and all. I think it gets too much information from the skip connections.
giving CLIP embeddings to the diffusion model as a class embedding
I thought that since most of the original openai models had a class embedding, it would help to give it class information (just replace the nn.embed with a nn.linear) after testing I found that it makes almost zero difference.

So after settling on the current approach, I tried different image embeddings to see how it would affect the diffusion model:

GumbelVQ f8 from VQGAN (128x128 image size for 256x16x16 latent size)
a custom trained DVAE from DALLE-pytorch (with same latents as above, 256 embed dimension and 8192 tokens)
a classifier that uses the DVAE encoder but with the decoder replaced by a classifier head that distills CLIP
I was curious to see what would happen if the latents didn't contain information about reconstruction but semantic content of the image

here are some results from my early experiments:

ground truth

a person with glasses is a particularly difficult case so I've been using this image from unsplash for testing

GumbelVQ f8

DVAE:

Classifer:

I tested on a lot more images, but overall I think the Gumbel f8 embeddings work better for reconstruction. The classifier approach is more like a complete re-interpretation and looks very GAN-like. More training might help but I imagine it would end up similar to regular clip-guided-diffusion.

I also tried training the embeddings together with the diffusion model, from scratch. The codebook collapsed and didn't move much - I think there's a conceptual issue with this, since the diffusion model is training for denoising and we want the embeddings to train for reconstruction, which are not totally aligned goals.

Some generations from the current model:

256x256 no clip guidance

256x256 with clip (prompt: a girl wearing glasses)

so I find it interesting that the initial noise matters a lot more than clip guidance. I haven't been tracking FID scores or anything so it's very possible that the model is still under-trained.

64x64 upscaled to 256x246

128x128 downscaled to 64x64 then upscaled to 256x256

it's a bit of a wash but I think the 256x256 model does a bit better in the mouth and nose areas

There's one experiment that I really wanted to try but couldn't get working - the idea is to train a VQGAN model on an edge-detector version of the image (a Sobel filter would work I think), then feed these edge embeddings to the DDPM. This way the lighting and colors would be entirely generated by the diffusion model, and the transformer would be only responsible for the most salient features of the image. With only edges, the codebook could be a few hundred instead of 8192, possibly enabling 256x24x24 latents which would help a lot with the global structure.

I did try this but couldn't get the VAE to converge (either the DALLE-pytorch DVAE or the VQGAN). There's some kind of issue when the image is predominantly black.

anyways, try it out and let me know what you think. I'd be interested in any suggestions on other vector-quantized image embeddings that we can use. I'm still training the 256x256 model but it's going pretty slowly on 4x3090s

iamrajiv · 2021-10-08T14:58:12Z

iamrajiv
Oct 8, 2021

3 replies

Jack000 Oct 8, 2021
Author

ah I just put it on aws s3. I'd recommend cloudflare r2 when that launches, or for now just a regular web server with cloudflare in front of it caching the images

Jack000 Oct 9, 2021
Author

the code should be the same. One tip I'd mention is that I always interleave the images and videos (image, image, video, image, image, video) so that while the user looks at the image the video buffers in the background.

Jack000 Oct 10, 2021
Author

if you already have it online I'd just put cloudflare in front. There are some tutorials online for how to do this with github pages: https://blog.cloudflare.com/secure-and-fast-github-pages-with-cloudflare/

S3 can get expensive if you get a lot of traffic.

afiaka87 · 2021-10-09T16:58:48Z

afiaka87
Oct 9, 2021

This is a very interesting line of work! I haven't yet grokked the architecture completely, but I'll dig into this when I have some more time.

Thanks for spending the time; training guided diffusion (in the general case) is indeed an extremely slow endeavor. A fine usage for your 4x3090s!

0 replies

borisdayma · 2021-10-09T19:03:02Z

borisdayma
Oct 9, 2021

This is really cool! Actually I wanted to try it out and asked about it recently (you should join DALLE-Pytorch discord).

I had only thought of the simplest approach: "deblurring" VAE image output with the diffusion model, the VAE being frozen.
I'm a bit disappointed that it didn't work out.

The idea of using the diffusion model early is very interesting but shouldn't you put it a bit after to separate encoding from decoding (so you can have transformers task on encoded tokens).

Also I'm not very familiar with diffusion models but do you have to add noise?

For the simple model that would just "deblur" VAE output. I was thinking that the input would be encoded image and output would be the original but didn't think we would also need to add noise (the noise is already know and is the difference between the 2). If that's the case maybe pix2pix, stylegan, etc models are more appropriate?

For the model where you add it at the start of decoder the noise would basically the one from quantized model (codebook - output of encoder).

3 replies

crowsonkb Oct 10, 2021

The idea of using the diffusion model early is very interesting but shouldn't you put it a bit after to separate encoding from decoding (so you can have transformers task on encoded tokens).

The entire diffusion model is the decoder, actually, the encoder is separate.

Also I'm not very familiar with diffusion models but do you have to add noise?

Yes, diffusion models need an input that is the same size as their output for a noised image (you start with mean 0 std 1 Gaussian noise). Conditional diffusion models like the ones described here use a second input for the condition which is usually the same size as the noise input but apparently it works to add it later in the downsampling stages of the model, as done here. The condition for a diffusion upscaler is the low-res image and the condition for a VAE reconstruction error repairer is either the RGB image output by the VAE decoder or the VAE tokens directly. Actually I'm surprised feeding in the VAE tokens didn't work well, it should, I think!

Jack000 Oct 10, 2021
Author

in the super resolution model x is an image (shape of 3x256x256) but in our case x is an embedding (shape of 256x16x16). The diagram is for the diffusion model only, the VAE is trained separately and frozen.

the diagram came from the google paper so the variable names are a bit different (in the openai code x=image, y=class label and low_res=low resolution image [x in the diagram]) The diagram kind of makes it look like a VAE but diffusion models work completely differently basically.

I think you could put the embeddings after the middle block, but I haven't tried it.

Jack000 Oct 10, 2021
Author

I originally wanted to generate 64x64 images with the DALLE-pytorch VAE, then upscale to 256x256 with a fine-tuned diffusion upscaler.

I fine-tuned the diffusion model on the blurred VAE images for a few days, but most of the images came out like this:

It's possible that it would have worked with more training. It didn't seem to be improving so I didn't go further with that approach

worked ok for non-human images though

stayforapple · 2022-08-11T07:11:15Z

stayforapple
Aug 11, 2022

Is it applicable to the speech compression ?

0 replies

realrandolph · 2022-08-20T21:07:12Z

realrandolph
Aug 20, 2022

This is excellent

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining transformer with diffusion models #375

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Combining transformer with diffusion models #375

Jack000 Oct 8, 2021

Replies: 5 comments · 6 replies

iamrajiv Oct 8, 2021

Jack000 Oct 8, 2021 Author

Jack000 Oct 9, 2021 Author

Jack000 Oct 10, 2021 Author

afiaka87 Oct 9, 2021

borisdayma Oct 9, 2021

crowsonkb Oct 10, 2021

Jack000 Oct 10, 2021 Author

Jack000 Oct 10, 2021 Author

stayforapple Aug 11, 2022

realrandolph Aug 20, 2022

Jack000
Oct 8, 2021

Replies: 5 comments 6 replies

iamrajiv
Oct 8, 2021

Jack000 Oct 8, 2021
Author

Jack000 Oct 9, 2021
Author

Jack000 Oct 10, 2021
Author

afiaka87
Oct 9, 2021

borisdayma
Oct 9, 2021

Jack000 Oct 10, 2021
Author

Jack000 Oct 10, 2021
Author

stayforapple
Aug 11, 2022

realrandolph
Aug 20, 2022