-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arXiv '23 | Adding Conditional Control to Text-to-Image Diffusion Models. #72
Comments
Ablation Study: Why ControlNets use deep encoder? What if it was lighter? Or even an MLP? |
OverviewIn 2023, if we want to train an encoder to perform some tasks, we have four basic options as follows:
In our problem, we want to control Stable Diffusion, and the encoder will be trained jointly with a big Stable Diffusion (SD). Because of this, The option (1) and (2) are just similar and can be merged. They usually have similar performances. Note that
are both relatively preferred methods. Which one is "harder" or "easier" to train is a complicated question and even related to different training environments. We should not But in this post, let's pay more attention to the qualitative differences of these methods if they are already trained successfully. |
CandidatesLet us consider these architectures: ControlNet-Self (Finetuned)Below is the model architecture that we released many days ago as
It directly uses the encoder of Stable Diffusion (SD). Because it copies itself, let us call it ControlNet-Self. ControlNet-Lite (From Scratch)Below is a typical architecture to train lightweight encoders from scratch. We just use some simple convolution layers to
Because it has relatively fewer parameters, let's call it ControlNet-Lite. Channels of layers are computed by instantiating the ControlNet-MLP (From Scratch)Below is a more extreme case to just use
In recent years, MLPs are suddenly popular again, and they are actually just |
Here We Go!This house image is just the first searching result when I search "house" in pinterest. Let us use it as an example: And this is the synthesized scribble map after preprocessor (you can use our scribble code to get this). Then let me show off a bit my prompt engineering skills. I want a house under the winter snow. I will use this prompt:
The ControlNet-Self is just our final released ControlNet and you can actually reproduce the results with below parameters. Note that we will just use same random seed ControlNet-Self ResultsControlNet-Lite ResultsControlNet-MLP Results |
Surprise, SurpriseIt seems that they all give good results! The only difference is in some aesthetic concepts. But why? Is the problem of controlling Stable Diffusion so trivial, and everything can work very well? Why not turn off the ControlNet and see what happens: Ah, then the secret trick is clear! Because my prompts are carefully prepared, even without any control, the standard Stable Diffusion can already generate similar images that have many "overlapping concepts/semantics/shapes" with the input scribble maps. In this case, it is true that every method can work very well. In fact, in such an "easy" experimental setting, I believe Sketch-Guided Diffusion or even anisotropic filtering will also work very well to change the shape of objects and fit it to some user-specified structure. But what about some other cases? |
The Non-Prompt TestHere we must introduce the Non-Prompt Test (NPT), a test that can avoid the influence of the prompts and test the "pure" capability of ControlNet encoder. NPT is simple - just remove all prompts (and put image conditions on the "c" side of cfg formulation "prd=uc+(c-uc)*cfg_scale" so that the cfg scale can still work). In our user interface, we call this "Guess Mode" because the model seems to guess contents from input control maps. Because no prompt is available, the ControlNet encoder must recognize everything on its own. This is really challenging, and note that all our production-ready ControlNets have passed extensive NPT tests before we made them publicly available. The "ControlNet-Self" is just our final released ControlNet and you can actually reproduce the results with below parameters. Note that we do not input any prompts. ControlNet-Self ResultsControlNet-Lite ResultsControlNet-MLP Results |
ObservationsNow things are much clearer.
But, is this really important?The answer depends on your goal.
But if you want to achieve a system with the quality similar to Style2Paints V5, then to the best of my knowledge, the ControlNet-Self is the only solution. |
The text was updated successfully, but these errors were encountered: