Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv '23 | Adding Conditional Control to Text-to-Image Diffusion Models. #72

Open
NorbertZheng opened this issue Mar 1, 2023 · 8 comments

Comments

@NorbertZheng
Copy link
Owner

NorbertZheng commented Mar 1, 2023

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Mar 1, 2023

Ablation Study: Why ControlNets use deep encoder? What if it was lighter? Or even an MLP?

@NorbertZheng
Copy link
Owner Author

Overview

In 2023, if we want to train an encoder to perform some tasks, we have four basic options as follows:

  1. Train a lightweight encoder from scratch.
  2. Train a lightweight encoder by fine-tuning existing encoders.
  3. Train a deep encoder from scratch.
  4. Train a deep encoder by fine-tuning existing encoders.

In our problem, we want to control Stable Diffusion, and the encoder will be trained jointly with a big Stable Diffusion (SD). Because of this, the option (3) requires super large computation power and is not practical unless you have as many A100s as EMostaque does. But we do not have that, so we may just forget about (3).

The option (1) and (2) are just similar and can be merged. They usually have similar performances.

Note that

  • fine-tuning existing deep encoders,
  • and training lightweight encoders from scratch

are both relatively preferred methods. Which one is "harder" or "easier" to train is a complicated question and even related to different training environments. We should not presume the learning behaviors by simply looking at the number of parameters.

But in this post, let's pay more attention to the qualitative differences of these methods if they are already trained successfully.

@NorbertZheng
Copy link
Owner Author

Candidates

Let us consider these architectures:

ControlNet-Self (Finetuned)

Below is the model architecture that we released many days ago as

  • a final solution.

It directly uses the encoder of Stable Diffusion (SD). Because it copies itself, let us call it ControlNet-Self.

221408908-57480f60-55ca-404c-83ec-d1ac03c7f7f1

ControlNet-Lite (From Scratch)

Below is a typical architecture to train lightweight encoders from scratch. We just use some simple convolution layers to

  • get some embedding,
  • and inject the Stable Diffusion (SD) U-Net.

Because it has relatively fewer parameters, let's call it ControlNet-Lite. Channels of layers are computed by instantiating the ldm python object.

ControlNet-MLP (From Scratch)

Below is a more extreme case to just use

  • a pixel-wise Multilayer Perceptron (MLP).

In recent years, MLPs are suddenly popular again, and they are actually just $1\times 1$ convolutions. We use AVG pool as downsampling and then let us call it "ControlNet-MLP". Channels of layers are computed by instantiating the ldm python object.

221408927-300c4f15-4c39-4fb6-9864-1b819e8c28d8

@NorbertZheng
Copy link
Owner Author

Here We Go!

This house image is just the first searching result when I search "house" in pinterest. Let us use it as an example:
221408945-7fe130a8-2a9b-4019-ada7-b0724d7bbc90

And this is the synthesized scribble map after preprocessor (you can use our scribble code to get this).
221408976-40600115-72cf-4dad-afd1-f7dcbea5985d

Then let me show off a bit my prompt engineering skills. I want a house under the winter snow. I will use this prompt:

  • Prompt:
Professional high-quality wide-angle digital art of a house designed by frank lloyd wright.
A delightful winter scene. photorealistic, epic fantasy, dramatic lighting, cinematic,
extremely high detail, cinematic lighting, trending on artstation, cgsociety, realistic
rendering of Unreal Engine 5, 8k, 4k, HQ, wallpaper
  • Negative Prompt:
longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits,
cropped, worst quality, low quality

The ControlNet-Self is just our final released ControlNet and you can actually reproduce the results with below parameters. Note that we will just use same random seed 123456 for all experiments and generate 16 images without cherry-picking.

221408989-4b00b150-a709-473f-85b8-02b6e2920fda

ControlNet-Self Results

221408996-e9de3cdc-a8bb-4baf-9f59-bdf5566cfe0c

ControlNet-Lite Results

221409014-46fb7bdb-2c1f-4212-b97b-0134b1666301

ControlNet-MLP Results

221409020-20943cfe-12c3-4981-8a5b-1b38b21bdc1c

@NorbertZheng
Copy link
Owner Author

Surprise, Surprise

It seems that they all give good results! The only difference is in some aesthetic concepts.

But why? Is the problem of controlling Stable Diffusion so trivial, and everything can work very well?

Why not turn off the ControlNet and see what happens:

221409036-f8c85fb6-0162-47af-b0dd-b91d783d2918

Ah, then the secret trick is clear!

Because my prompts are carefully prepared, even without any control, the standard Stable Diffusion can already generate similar images that have many "overlapping concepts/semantics/shapes" with the input scribble maps.

In this case, it is true that every method can work very well.

In fact, in such an "easy" experimental setting, I believe Sketch-Guided Diffusion or even anisotropic filtering will also work very well to change the shape of objects and fit it to some user-specified structure.

But what about some other cases?

@NorbertZheng
Copy link
Owner Author

The Non-Prompt Test

Here we must introduce the Non-Prompt Test (NPT), a test that can avoid the influence of the prompts and test the "pure" capability of ControlNet encoder.

NPT is simple - just remove all prompts (and put image conditions on the "c" side of cfg formulation "prd=uc+(c-uc)*cfg_scale" so that the cfg scale can still work). In our user interface, we call this "Guess Mode" because the model seems to guess contents from input control maps.

Because no prompt is available, the ControlNet encoder must recognize everything on its own. This is really challenging, and note that all our production-ready ControlNets have passed extensive NPT tests before we made them publicly available.

The "ControlNet-Self" is just our final released ControlNet and you can actually reproduce the results with below parameters. Note that we do not input any prompts.

221409053-ad857843-3456-41b4-81d7-7f65da321bdd

ControlNet-Self Results

221409050-8d904d26-afe3-427b-bd41-7023dd9e6156

ControlNet-Lite Results

221409061-6ac516f6-658f-49d2-8094-ddc6cae19836

ControlNet-MLP Results

221409064-e76ded89-1c6a-4c44-bdf3-da86050404bc

@NorbertZheng
Copy link
Owner Author

Observations

Now things are much clearer.

  • The difference between different encoders is in their capability to recognize contents in input control maps.
    • ControlNet-Self has strong recognition capability so that it works well even without prompts.
    • ControlNet-Lite and ControlNet-MLP are weak in this capability, and they cannot control Stable Diffusion (SD) to generate meaningful images without the help of user prompts.

But, is this really important?

The answer depends on your goal.

  • If your goal is to build a method as robust as the production-ready ControlNets, this capability is important. In a production environment, we never know how strange user prompts will be, and user prompts are not likely to always cover everything in the control maps. We always want the encoder to have some recognition capability.
  • If your goal is to solve some specific problems in a research project, or if you have very aligned or fixed inputs, then perhaps you may consider some lightweight solutions (although I personally think the design of ControlNet-Self can also work well in this case).

But if you want to achieve a system with the quality similar to Style2Paints V5, then to the best of my knowledge, the ControlNet-Self is the only solution.

@NorbertZheng
Copy link
Owner Author

Before We End

Why we need these zero convolutions?

Now we also know why we need these zero convolutions
221409333-652b2dcd-6074-4042-be7b-0bd08ba7cc42

Just imagine that

  • these layers are initialized with noise, then a few training steps will immediately destroy the trainable copy.

The risk is very high that you are just training the already destroyed trainable copy from scratch again. To obtain the aforementioned object recognition capability would require extensive retraining -- similar to the amount of training required to produce the Stable Diffusion model itself.

Whyy it is important that ControlNet encoder should receive prompts?

We also know why it is important that ControlNet encoder should also receive prompts:
221409341-2ec6645d-45e3-414c-818b-d0a7b0f8324c

With this part, the ControlNet encoder's object recognition can be guided by the prompts so that

  • even when the prompts and the recognized control map semantics conflict, the user's prompt remains dominant.

For example, we already know that without prompts, the model can recognize the house in the house scribble map, but we can still make it into cakes: "delicious cakes" using that house scribble map
221409080-01d93680-a61d-46c2-82ee-4a739152caa4

Finally, note that this field is moving very fast and we won’t be surprised if some method suddenly comes out with just a few parameters and can also recognize objects equally well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant