-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving initialization #2
Comments
Yes, that would be a huge improvement. But isn't ResNet initialization included with pbaylies' fork of the encoder? That one is well-maintained, and I'm not trying to duplicate it. It's also not too hard to port to StyleGAN2. |
Other than that, if you pull the latest changes, you can use the projector, which may be a better choice anyway. |
@SimJeg was the inspiration and provided the initial code for doing the ResNets in my repo in the first place; nice work! |
@SimJeg: Ok, I've looked at this more closely ;) Three ResNet initializations below, from left to right: @pbaylies, StyleGAN V1; @Quasimondo, via twitter, using your (18, 512) ResNet above; myself, ditto, after 5 minutes of training. The benefit of your approach is clearly visible. I'm still wondering though: is ResNet initialization just a useful encoder optimization for faster convergence, or can it be demonstrated that it actually leads to better convergence than initializing with w_avg and running puzer's encoder or tkarras' projector? And if so, does that happen with specific classes of portraits, and is there a sweet spot at 2K or 3K iterations after which initialization doesn't matter? My understanding, after reading the Image2StyleGAN paper, is that ~5K iterations are sufficient to encode anything into W-space with pretty high fidelity, with the possible exception of subtly translated faces (for example: misaligned portrait looks slightly worse than banana after 5K iterations). I'd be curious to see a failure case that can be fixed with better initialization. But even if it's just about speed, it may be a good idea to save everyone a few cycles by adding an option to download a pretrained ResNet. |
@Quasimondo spotted a mistake in my code : in the `finetune_18` function, the `w_mix` argument is missing in the `get_batch` call of the training phase. So
the function does nothing more than the first one ! I hope someone can
train it and see if it improves the initialization.
Simon
Le dim. 29 déc. 2019 à 13:35, rolux <[email protected]> a écrit :
… @SimJeg <https://github.com/SimJeg>: Ok, I've looked at this more closely
;)
Three ResNet initializations below, from left to right: @pbaylies
<https://github.com/pbaylies>, StyleGAN V1; @Quasimondo
<https://github.com/Quasimondo>, via twitter
<https://twitter.com/quasimondo/status/1210984685226119173>, using your
(18, 512) ResNet above; *myself*, ditto, after 5 minutes of training.
[image: Mona Lisa ResNet]
<https://user-images.githubusercontent.com/802665/71556941-05fd1600-2a3f-11ea-9235-dce9a408d5ec.png>
The benefit of your approach is clearly visible.
I'm still wondering though: is ResNet initialization just a useful encoder
optimization for faster convergence, or can it be demonstrated that it
actually leads to *better* convergence than initializing with w_avg and
running puzer's encoder or tkarras' projector? And if so, does that happen
with specific classes of portraits, and is there a sweet spot at 2K or 3K
iterations after which initialization doesn't matter?
My understanding, after reading the Image2StyleGAN paper
<https://arxiv.org/pdf/1904.03189.pdf>, is that ~5K iterations are
sufficient to encode *anything* into W-space with pretty high fidelity,
with the possible exception of subtly translated faces (for example: misaligned
portrait looks slightly worse than banana
<https://twitter.com/robertluxemburg/status/1211014855135772677> after 5K
iterations). I'd be curious to see a failure case that can be fixed with
better initialization.
But even if it's just about speed, it may be a good idea to save everyone
a few cycles by adding an option to download a pretrained ResNet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2?email_source=notifications&email_token=ADE64VKNL5BCAJ6U3F2E55TQ3CKPTA5CNFSM4J76MGBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHY6M7I#issuecomment-569501309>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADE64VPOYYLMWJOJJUR6WWLQ3CKPTANCNFSM4J76MGBA>
.
|
@SimJeg: I have trained a ResNet, and will post some results shortly. resnet_18_20191231.h5 (best test loss: 0.04438) If you get a |
Great job @rolux !!
@pbaylies, as you studied this encoder question in depth. What are your
main feedbacks ? Does EfficientNet bring additional precision for
initialization ? What is the best loss for a fast and accurate image
encoding ?
Le mar. 31 déc. 2019 à 15:05, rolux <[email protected]> a écrit :
… @SimJeg <https://github.com/SimJeg>: I have trained a ResNet, and will
post some results shortly.
resnet_18_20191231.h5
<https://rolux.org/media/stylegan2/resnet_18_20191231.h5> (best test
loss: 0.04438)
If you get a TypeError: Unexpected keyword argument passed to optimizer:
learning_rate
you'll need to upgrade keras from 2.2.* to 2.3.* - learning_rate was
renamed to lr :(
[image: predictions_paintings]
<https://user-images.githubusercontent.com/152646/71623261-93756d00-2bdb-11ea-948d-c5236ca06d88.png>
[image: predictions_covers]
<https://user-images.githubusercontent.com/152646/71623263-97a18a80-2bdb-11ea-9919-8a4501cf4b2f.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2?email_source=notifications&email_token=ADE64VN7THNC32CJKGA2CJ3Q3NGQXA5CNFSM4J76MGBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH4H7IA#issuecomment-569933728>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADE64VJIUSKXDV2TASU256LQ3NGQXANCNFSM4J76MGBA>
.
|
Very nice work, @rolux ! @SimJeg in my experience, the loss decreases gradually but improves over time. My network architecture might have been a bit different, I had some layers I added on after the main ResNet, mainly to avoid having a huge dense layer. I got my best performance from using a ResNet but just training it for longer, but I liked having support for EfficientNet, just to have more potential options. I think there's still a lot that could be explored here, as far as different possible architectures and configurations. |
It took me a while to appreciate the fact (thanks to @pbaylies for the insight) that encoder output can have high visual quality, but bad semantics. The W(18, 512) projector, for example, explores W-space so well that its output, beyond a certain number of iterations, becomes meaningless. It is so far from w_avg that the usual applications -- interpolate with Z -> W(1, 512) dlatents, apply direction vectors obtained from Z -> W(1, 512) samples, etc. -- won't work as expected. To illustrate this, I have run the projector on the 8 samples I had posted above, once for 1000 and once for 5000 iterations, and plotted the visual quality -- the projector's For comparison: the mean semantic quality of Z -> W(1, 512) dlatents is 0.44. 2 is okay, 4 is bad. To keep the dlatent closer to w_avg, one can either clip it or introduce a penalty, at the expense of some visual quality. Both options are present in pbaylies' encoder, but I haven't instumented it yet. Now what about ResNet initialization? I have added it to the projector, and tested it on 100 samples. The results suggest that, in this particular setup, it doesn't make a considerable difference. Of course, this is by no means the last word on the matter. I could train the ResNet for longer, and/or play with w_mix. I could instrument puzer's encoder, add w_avg and ResNet initialization, and see how it compares. Also, my choice of metrics for visual and semantic quality may be misguided, just as my choice of samples (100 faces from FFHQ). |
I am not sure how meaningful that semantic quality measure is - I guess it depends on what one is looking for in that latent space. If I am not mistaken it just measures how "normal" (or as I prefer to say how "boring") the resulting image is. To me the really interesting images are those that are as far away from the mean as possible without showing artifacts or breaking up. So the way I would interpret your semantic quality graph is that with ResNet initialization you are able to reach hard-to-find images that are outside the normal distribution quicker, and at least with the ResNet encoder I trained for myself I subjectively feel that this is the case - right now I only run 200 iterations and often with "simple" faces it gets to the similar-enough state after just 50-70 iterations. Of course with 200 iterations you will not get all the details, like occluding hands, hair strains or certain glasses, but at least for my purposes it's more important to get a lot of okay-faces rather than get just a few perfect ones. |
@rolux thank you for investigating this! I find it surprising that the ResNet doesn't seem to get you any initial advantage in the visual quality metric. Is this ResNet outputting into W(1, 512) or W(18, 512) space? Also note that in my code for training a ResNet, I added a parameter for using the truncation trick in the generated training data, so you can control your initial semantic quality output, as it were. @Quasimondo you're right about the normal or boringness, as it were; I think this metric captures how well the model can represent or understand an image based on what it learned from the training data distribution. This is more useful if you're working with interpolations; if the goal is the image itself, then visual quality would be the important metric. |
@pbaylies: I'm outputting into W(18, 512). It's SimJeg's code, with the @Quasimondo: From what I've seen, the visual distance drop you are getting from your ResNet seems to be more significant. Maybe you've made more changes than just that one fix? Trained it for longer? Or you're using the encoder and not the projector? If your specific use case involves encoding a large number of images, then getting acceptable results after 200 iterations would be a huge improvement. On the other hand, if you just want a few faces with high accuracy, then initialization doesn't seem to matter. With regards to semantics: I totally agree that among Z -> W(1, 512) mappings, the interesting faces are usually the ones further away from w_avg. It's just that you can push projections much further than that. For example, take a look at this video, or the still below: On the left is a Z -> W(1, 512) face, ψ=0.75, with a semantics score of 0.28. On the right is the same face projected into W(18, 512), it=5000, with a score of 3.36. They both transition along the same "surprise" vector. On the left, this looks gimmicky, but visually okay. On the right, you have to multiply the vector by 10 to achieve a comparable amount of change, which leads to obvious artifacts. As long as you obtain your vectors from annotated Z -> W(1, 512) samples, you're going to run into this problem. Should you just try to adjust your vectors more cleverly, or find better ones? My understanding is that this won't work, and that there is no outer W-space where you can smoothly interpolate between all the cool projections that are missing from the regular inner W-space mappings. (Simplified: Z is a unit vector, a point on a 512D-sphere. Ideally, young-old would be north-south pole, male-female east-west pole, smile-unsmile front-back pole, and so on. W(1, 512) is a learned deformation of that surface that accounts for the uneven distribution of features in FFHQ. W(18, 512) is a synthesizer option that allows for style mixing and can be abused for projection. But all the semantics of StyleGAN reside on W(1, 512). W(18, 512) vectors filled with 18 Z -> W(1, 512) mappings already belong to a different species. High-quality projections are paintings of faces.) Should you use the encoder "just for the image"? As far as I can see, nothing keeps you from projecting arbitrary video into StyleGAN. If that happens to be a well-aligned portrait shot, are you the first person who can make a StyleGAN face sneeze or stick out her tongue? Or just the inventor of an extremely energy inefficient video codec? |
@rolux Yes, I did some more changes to the training method and also trained for longer. The major change is probably that I also mix W's resulting from previous descends into the training set, because those are extremely unlikely to be returned by just random initialization (even with style mixing) - and before you ask: no the W's of my examples were not part of the training. The other changes are just using a bigger training set size and training with more epochs. I like your point that high-quality projections are just paintings of faces - I have not analyzed which layers of W are mostly responsible for "rare" details, but I suspect that the heavy lifting is all done by the style layers. |
@Quasimondo: To get a better sense of which layer does what, I used to render these style grids: Top row: style target, midpoint, style source, 0-3 mix (coarse styles, the "pose"), 4-7 mix (middle styles, the "person"), 8-17 mix (fine styles, the "style"). Below: single-layer mixes from 0 to 17. Check 2 for hair, 4 for shape and smile, 6 for gender and eyes, 8 for light, 10 for lipstick, etc. Maybe that also helps to visualize when the semantics of a projection get bad. Top: 1000 iterations, bottom: 5000 iterations. In the bottom grid, 9 and 11 start to look unhealthy. |
@Quasimondo: The changes you made to the ResNet training process sound interesting, I'll try to find out how much of that I can reproduce. I'm a still a bit reluctant to add it to the repo because it seems like a step down a slippery slope: If I get this to work, I would want to try out EffNet initialization for comparison, then clipping vs. penalty to keep dlatents closer to w_avg, and so on. I would probably end up with buggy, untested and less well maintained implementation of half of pbaylies' encoder. So maybe, if something comes out of this, I'd rather submit it as a pull request for that one. We'll see... |
@rolux that's basically the same slippery slope I went down with Puzer's encoder in the first place; go ahead if you like, it could use a rewrite by now, or at least a solid refactoring + ablation test. Note that I do have code already in there for training ResNets and EfficientNets (surely buggy / out of date by now given that the library I had targeted had just been released), and there is code for using the mapping network to generate mixed latents for training in (18, 512) space. |
This discussion was an interesting read, which kind of answers my question about the "tiled" projector (#21). Given the semantic issue with Moreover, if I understand correctly, the grids shown in this post were obtained by projecting "Mona Lisa" with
This contrasts with my experience with the default projector from Nvidia's original repository, which seemed to converge fast (although it cannot perfectly fit the real image because it uses Now, I am left wondering if the default projector suffers from the same dependency on the number of iterations: it visually looked like it converged, but would more iterations change its semantics without me realizing it? That would be disappointing. |
Hey there, I come back here, because the trade-off between visual quality and semantic quality is discussed in this paper:
It is done by the people behind https://github.com/orpatashnik/StyleCLIP Semantic quality is called "editability".
|
Dear @rolux,
Many thanks for porting the work of @Puzer for StyleGan2. I noticed the optimization sometimes fails due to bad initialization of the dlatent variable
W
. I tried to finetune two ResNets for a better initialization.W[:, 0]
of shape 512 from the imageX
generated byW
. The ResNet is initialized with ImageNet weights.W
of shape (18, 512). We use style mixing forW
to cover a wider distribution of images. The ResNet is initialized with the weights of the first ResNet.Code is quick & dirty but functional. This initialization solves some failure cases and speed up convergence. Maybe by digging into this direction, it would be possible to avoid completely the optimization as it was done in the neural style transfer field ? Below an example where all 3 initializations work well :
Zero initialization (current behavior) :
ResNet 1 initialization :
ResNet 2 initialization :
The text was updated successfully, but these errors were encountered: