-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAE training sample script #3726
Comments
Currently I don't have the bandwidth to dive deeper into this, but I agree an easy training script for VAEs would make sense :-) Let's see if the community has time for it! |
Definitely would love to dive deeper into this but would love some guidance if possible. |
Update: VAE training script runs successfully but I'll need to test on a full dataset and evaluate the results. @zhuliyi0 Is there a dataset you would like me to try fine-tuning on? Preferably one hosted on hugging face? |
wow super cool! I was planning to train VAE to re-create certain architecture styles with consistent details, so I found this dataset on HF: https://huggingface.co/datasets/Xpitfire/cmp_facade Not a big dataset though, not sure if it works for you. Also there are images of extreme aspect ratio. Let me know if there are more specific requirement on the dataset and I will try to find/assemble a better one. |
@zhuliyi0 No worries and thanks for responding. Might be a little busy this week but I'll try out with the new dataset and see if the VAE is improving in terms of learning the new data. |
I got the script to run, but looks like my 12G VRAM is far from enough. I assume vram will go down once adam8bit and other optimizations is in place? |
@zhuliyi0 Perhaps but I can't really confirm anything at the moment. I'm basing hardware requirements on the docs (https://huggingface.co/docs/diffusers/training/text2image):
But this is obviously for training the Stable Diffusion model so the requirements will be different for sure. At this time, I'm trying to confirm that the AutoencoderKL is indeed being fine-tuned with reasonable performance before actually implementing further techniques like EMA weights, MSE focused loss reconstruction + EMA weights, etc. (details are here: https://huggingface.co/stabilityai/sd-vae-ft-mse-original). If you would like to work on this PR together I would appreciate the help since I maybe a little MIA for the next 2 weeks at most. |
I am a total newbie on python and ML. I am still trying to run the script on my local GPU, right now the OOM is gone after I stick to the arguments you provided in the script, vram and training speed is fine, but there is an error when saving validation image, basicly says an image file inside a wandb temp folder cannot be found. I checked and there is no such folder. Don't know how to use wandb to debug this one. Colab seems to be running without error, but the speed is a bit slow compared to my local GPU, probably normal for T4. From validation images, I see signs of improvement of image details I was talking about, will validate with inferencing after a reasonable sized training has finished. |
I got training to run on my local GPU on Windows. The directory error was due to path naming convention in Windows. Again from validation images I can see it was learning. The loss was also going down. I noticed there is a vram leak in log_validation function when the number of test image is 5 or above. I also failed to use the trained vae inside a1111 for inferencing, giving error "Missing key(s) in state_dict“. |
Hey @zhuliyi0 , thanks for taking the time to test things. The script is definitely not perfect yet but I'll work on the things you mentioned. In terms of transferring the VAE over to a1111 I'm not quite sure about that. I haven't played around with a1111 so I would need some time. My current focus will be to clean up the script and implement the memory saving techniques to improve training. Then I'll see how we can make the VAE transferrable to a1111. |
Totally understand that the script wouldn't be perfect at this point. I am glad to help whenever I can. I will try using pipeline to test inference performance. @pie31415 |
here is a training test run: https://wandb.ai//zhuliyi0/goa_5e5/reports/VAE-training-test--Vmlldzo0ODYzMzcx Also did a quick inference test using a finetuned model that was trained on the same dataset, compare results with the default and trained VAE. I can confirm VAE is adding details, making the image better. Another issue: the output from trained VAE looks white-washed. This happens on both sd15 and the finetuned model. I had to do some brightness and contrast change to the image. The validation images during training do not have this issue. |
Your wandb experiment seems to be private/locked.
Are you referring to the default VAE or custom trained one? If it is a custom trained one can you provide a link to the weights? It'll be extremely beneficial to have some results to compare to when I'm fixing up experiments for the script.
Hmm yeah, it may be how we're training the VAE. I'll take a look over the weekend. Most likely the substantial changes will have to be done this weekend since I'm a little preoccupied before then. Thanks a lot for your patience though. 🤗 |
I made the project public. And the weight file: https://drive.google.com/file/d/1gTQqWuVA7m7GYIStVbulYS-tN_CMY-PM/view?usp=sharing Some inference image that shows the white-wash issue, using VAE at step 4k - 40k, gradually getting worse: https://drive.google.com/drive/folders/16ivRLiLgb7dDixfFbNIL7vf_wNe9BaRO?usp=sharing |
Hello, The lpips loss gives great results however (without it, the image tends to become too 'smooth'). I used this library.
|
Thanks for the feedback, that definitely might be the case. I'll take a look and make the necessary changes. Thanks again. |
@zhuliyi0 I updated the PR with @ThibaultCastells 's code. Can you give your training another try and let us know the results? (e,.g, is the white-washing issue improved) Also, I took a look at the VRAM issue you mentioned with test_images >= 5. I can't seem to reproduce the issue can you give more details on this if you're still experiencing this issue? @ThibaultCastells I've credited the recent commit to you and I plan to mention your contribution in the PR as well. |
@patrickvonplaten Do you mind giving the PR a look over when you're free? |
@pie31415 thank you very much! I will let you know if I have other improvement suggestions |
By the way:
With a scale coefficient around |
@pie31415 I re-run a training with new script, the result was conceivably no different. The white wash issue still exist, the same as previous. Seems like the training gradually makes the contrast lower and brightness higher, but not by much. @ThibaultCastells do you mean "learning rate" when you say "coefficient"? |
No I meant the coefficient that multiplies the loss term ( Note that by default |
I noticed that there is no |
@ThibaultCastells Do you have any thoughts about why the VAE might be outputting white washed reconstructions? I seem to have seen some Civitai models that had a similar issue. Not sure how it was resolved though. |
You're right. A blunder on my part. I guess it must have been removed when I was playing around with things and forgot to put it back. Thanks for the catch |
@pie31415 I am not too surprised that this issue happens when using only the mse loss, because this is a very different training configuration than in the paper, so we don't know what to expect in this case. Therefore I would like to confirm that @zhuliyi0 changed the default value of the scale coefficients of the loss when he checked the new code. And if so, what value was used? Note that when they finetune the vae for SD they only finetune the decoder, that's probably why they do not use kl loss (they do not need it since the decoder does not affect the latent space). Also, not related but is it normal that there is no .eval() when evaluating the model (and therefore another .train() after evaluation)? Is it handled by the accelerator.unwrap_model function? |
see #4636 for what happens with that normalisation range :D |
I compared As a result, I'm sure that we need to fix our loss term. I recommend using |
are they use the same Autoencoder model? |
Can you make a commit for these changes? I'll work on the VAE loss and hopefully try to get it matching with LDM. |
Hi Everyone ! I have a question regarding the decoder training. In my mind, it was necessary to sample in the encoder output distribution : src : https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73 But in your implementation you used directly the "mode" to feed the decoder : for epoch in range(first_epoch, args.num_train_epochs):
vae.train()
train_loss = 0.0
for step, batch in enumerate(train_dataloader):
with accelerator.accumulate(vae):
target = batch["pixel_values"].to(weight_dtype)
# https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoder_kl.py
posterior = vae.encode(target).latent_dist
z = posterior.mode()
pred = vae.decode(z).sample I tried to challenge this assumption and i performed 2 trainings with 100k iterations on CelebA-HQ. In my different tries, i noticed that mode seems to render better images, but i don't know what is the behavior when we'll train a LDM on top of that. To give uncertainty in the latentspace can definitely help the further diffusion process. Does somebody have any elements about that ? |
Hello @FrsECM !
It's hard to tell without trying, but I think we also need to keep in mind that the Stable Diffusion performance is bounded by the VAE performance: if the VAE can only generate blurry images then Stable Diffusion will produce blurry images, no matter how well the unet is trained. |
@FrsECM Would you like you share your vae training implementation |
Hi @trouble-maker007 @ThibaultCastells , For me it confirms that it's better to train with uncertainty. You can find my training script there : |
I think your method is correct. Although using posterior.mode() looks better, it actually abandons randomness. |
Hello, I have tried what you said for fp16 issue, but i still got an error :
can you share your full script? thanks!
|
@sapkun However, that VAE code isn't perfect too. I've modified it to train with my own data and will share it again once it's up on git. |
thanks for your reply, my question is you mentioned that you used the |
@sapkun
|
And if you train VAE and use it with stable diffusion, you should definitely learn only the decoder part. |
hi!can u finetune vae with fp16?? thanks! |
@yeonsikch thanks for the great work, however I met the problem of creating a negative tensor in running your example code above, see below: Any potential solutions? Thanks. |
I am a bit confused with HF AutoencoderKL after training CompVis AutoEncoder. |
@linnanwang Im sorry. I don't know too. I think that version issue (torch version) |
@humanely |
Can you elaborate about this? This is the result of left image: diffusiom model sampling and then feeding to decoder trained with noise and right image: diffusiom model sampling and then feeding to decoder trained without noise? and scale = 7.5 is the CFG scale? |
@linnanwang @jiangyuhangcn I made a "fork" of the code here: https://github.com/kukaiN/vae_finetune/tree/main I also had the same issue with mixed precision (I wanted to use bf16) and negative dimension (caused by mismatching precisions), so I made some modification. I also added xformers to the code. The changes are listed in the readme, but tldr force initializing the trainable weights and using autocast in the training loop fixes the code to run mixed precision |
@kukaiN Thanks! Good fix. Can you elaborate more about the cause (mismatching precisions)? |
@KimbingNg I just want to confirm if you reloaded the weights to float32 after the initial loading and the autocast scope contains the forward up to the backpropagation, like the snippet below. My suspicion is that the mixed precision error happens because a part of the model is not properly casted. I made the changes based on the linked question/discussion, but I didn't pinpoint which layer is causing the problem. # line 413 ~ 432:
# we load it with float32 here, but we cast it again right after
vae = AutoencoderKL.from_pretrained(model_path, ..., torch_dtype=torch.float32)
vae.requires_grad_(True)
# https://stackoverflow.com/questions/75802877/issues-when-using-huggingface-accelerate-with-fp16
# load params with fp32, which is auto casted later to mixed precision, may be needed for ema
#
# from stackoverflow's answer, it links to diffuser's sdxl training script example and in that code there's another link
# which points to https://github.com/huggingface/diffusers/pull/6514#discussion_r1447020705
# which may suggest we need to do all this casting before passing the learnable params to the optimizer
for param in vae.parameters():
if param.requires_grad:
param.data = param.to(torch.float32)
...
vae.to(dtype=weight_dtype) #weight_dtype is fp16 or bf16
...
# training loop:
# with autocast():
# forward process
# ...
# backpropagate the loss |
We have a VQGAN VAE: https://github.com/huggingface/diffusers/tree/main/examples/vqgan |
Can we use this for finetune sdxl vae or sd3 vae? |
I believe the current lack of easy access to VAE training is stopping diffusion models from disrupting even more industries.
I'm talking about consistent details on things that are less represented in the original training data. 64x64 res can only carry so much detail. Very often I get good result from latent space (by checking the low-res intermedia image) before the final image is ruined by bad details. No prompting or finetuning or controlnet could solve this issue, I tried, and l know lots of other people tried, and most of them are trying without realising that the problem cannot be solved unless the thing that produces the final details can be trained with their domain data.
Right now VAE cannot be easily trained, at least not by someone like me who is not very good at math and python, so there is definitly a demand here. May I hope there will be a sample script based on diffusors to start with? I tried mess with the ones in compvis repo but to no avail. Thanks in advance!
The text was updated successfully, but these errors were encountered: