the code of new version is using more VRAM #1141

Iipython · 2024-02-29T13:24:59Z

Hey, I am encountering the same problem today!!
I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded in 2024, Feb 28.
But I found the new code always reported"out of memory" by using the same configuration as follows:
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes:
the VRAM occupation with new code:

the VRAM occupation with old code:

why? what is different?

The text was updated successfully, but these errors were encountered:

kohya-ss · 2024-02-29T23:20:31Z

The PR #989 was merged into main branch 21, Dec 2023. I think it may cause this issue. Please add --ddp_gradient_as_bucket_view and --ddp_static_graph to reduce VRAM usage with multi GPU training.

storuky · 2024-03-21T12:16:01Z

@kohya-ss these flags didn't help me with multi-gpu training... I have 3x4090 onboard.
This is how I run sdxl_train.py with single GPU (I setup it for accelerate as well):

accelerate launch --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

It's going well.
But when I configure accelerate to use 3 gpu and run this command with --multi_gpu and --num_processes=3:

accelerate launch --num_processes=3 --multi_gpu --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant"  --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

I'm getting OOM error. I tried to add --ddp_gradient_as_bucket_view and --ddp_static_graph as you mentioned but I'm still getting OOM.

storuky · 2024-03-21T13:01:40Z

I reverted this PR locally and now it uses less VRAM.

kohya-ss · 2024-03-21T13:05:43Z

PR #989 fixes gradients synchronization. If #989 is reverted, the gradient is not synchronized, so it is similar to single GPU training in my understanding.

I'm not familiar with multiple GPU training, but could you try the training with --full_bf16 option? If it works, there might be some overhead in multi GPU training, and 24gb may not be sufficient.

storuky · 2024-03-21T13:14:20Z

@kohya-ss yes, full_bf16 works well (if we talk about VRAM usage) but it has much worse results in terms of accuracy 🤷‍♂️ For example, hair sticks together as dirty. Small detailed objects turn into blots... etc...
Probably, full_bf16 may need another optimizer/LR/Scheduler setup... Do you have some handy notes of what we need to know about full_bf16?

kohya-ss mentioned this issue Feb 29, 2024

Kohya started using more VRAM for SDXL and using more than it should be #1131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the code of new version is using more VRAM #1141

the code of new version is using more VRAM #1141

Iipython commented Feb 29, 2024

kohya-ss commented Feb 29, 2024

storuky commented Mar 21, 2024 •

edited

Loading

storuky commented Mar 21, 2024

kohya-ss commented Mar 21, 2024

storuky commented Mar 21, 2024

the code of new version is using more VRAM #1141

the code of new version is using more VRAM #1141

Comments

Iipython commented Feb 29, 2024

kohya-ss commented Feb 29, 2024

storuky commented Mar 21, 2024 • edited Loading

storuky commented Mar 21, 2024

kohya-ss commented Mar 21, 2024

storuky commented Mar 21, 2024

storuky commented Mar 21, 2024 •

edited

Loading