Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kohya started using more VRAM for SDXL and using more than it should be #1131

Open
FurkanGozukara opened this issue Feb 21, 2024 · 8 comments

Comments

@FurkanGozukara
Copy link

FurkanGozukara commented Feb 21, 2024

I have a config which was running on Kaggle fine in previous versions

Right now it is failing on 15 GB gpu

This should not happen

Same settings on OneTrainer uses lesser than 13.5 GB VRAM

Here it fails with 15 GB

It wasn't failing before

All images are 1024x1024
All cached

Here the full training used prompt

I did trainings in past in Kaggle and this exact prompt was working i even have a video of it here

https://youtu.be/16-b1AjvyBE

  accelerate launch --num_cpu_threads_per_process=4      
                         "./sdxl_train.py" --max_grad_norm=0.0 --no_half_vae    
                         --train_text_encoder --ddp_timeout=10000000            
                         --ddp_gradient_as_bucket_view --bucket_no_upscale      
                         --bucket_reso_steps=64 --cache_latents                 
                         --cache_latents_to_disk --full_fp16                    
                         --gradient_checkpointing --learning_rate="1e-05"       
                         --learning_rate_te1="3e-06"                            
                         --logging_dir="/kaggle/working/results/log"            
                         --lr_scheduler="constant" --lr_scheduler_num_cycles="1"
                         --max_data_loader_n_workers="0"                        
                         --resolution="1024,1024" --max_train_steps="1500"      
                         --mem_eff_attn --mixed_precision="fp16"                
                         --optimizer_args scale_parameter=False                 
                         relative_step=False warmup_init=False weight_decay=0.01
                         --optimizer_type="Adafactor"                           
                         --output_dir="/kaggle/working/results/model"           
                         --output_name="2024_02_21_kaggle"                      
                         --pretrained_model_name_or_path="stabilityai/stable-dif
                         fusion-xl-base-1.0"                                    
                         --reg_data_dir="/kaggle/working/results/reg"           
                         --save_every_n_epochs="1" --save_model_as=safetensors  
                         --save_precision="fp16" --train_batch_size="1"         
                         --train_data_dir="/kaggle/working/results/img"         
                         --vae="stabilityai/sdxl-vae" --xformers 
Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/./sdxl_train.py", line 779, in <module>
    train(args)
  File "/kaggle/working/kohya_ss/./sdxl_train.py", line 594, in train
    optimizer.step()
  File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 185, in patched_step
    return method(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/optimization.py", line 715, in step
    update = (grad**2) + group["eps"][0]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 242.00 MiB (GPU 1; 14.75 GiB total capacity; 14.34 GiB already allocated; 53.06 MiB free; 14.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@FurkanGozukara
Copy link
Author

@kohya-ss

@FurkanGozukara
Copy link
Author

currently it uses 15.7 GB minimum on Kaggle

So with P100 gpu it works but that means people can't use much faster T4 and Kaggle gives dual T4

Also who has 16 GB GPUs can't use properly either

@kohya-ss
Copy link
Owner

With these options, Text Encoder 2 is trained with the learning rate=1e-5, because --train_text_encoder is specified. I think OneTrainer may train Text Encoder 1 only. If you want to stop Text Encoder 2 training, please specify --learning_rate_te2=0.

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Feb 22, 2024

With these options, Text Encoder 2 is trained with the learning rate=1e-5, because --train_text_encoder is specified. I think OneTrainer may train Text Encoder 1 only. If you want to stop Text Encoder 2 training, please specify --learning_rate_te2=0.

wow this is a bug in that case because this is what bmaltais gui generates - i will report him
will test thank you and reply back here

so when we don't provide TE2 what does trainer uses? because this is a big problem for me

@FurkanGozukara
Copy link
Author

yep i verified this bug exists and breaks my config :/

thank you so much Kohya

@Iipython
Copy link

Iipython commented Feb 28, 2024

Hey, I am encountering the same problem today!!
I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded today.
But I found the new code always reported"out of memory" by using the same configuration as follows:
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes:
the VRAM occupation with new code:
0ed4b03598426da65535656a625367f

the VRAM occupation with old code:
da54cdaeb7aa6b8f4f82c2c7a3f7920

why? where is different?

@kohya-ss
Copy link
Owner

kohya-ss commented Feb 29, 2024

As I mentioned in #1141, multiple GPU issue seems to have another reason.

@hufenghufeng
Copy link

Hey, I am encountering the same problem today!! I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded today. But I found the new code always reported"out of memory" by using the same configuration as follows: --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes: the VRAM occupation with new code: 0ed4b03598426da65535656a625367f

the VRAM occupation with old code: da54cdaeb7aa6b8f4f82c2c7a3f7920

why? where is different?

same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants