Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the code of new version is using more VRAM #1141

Open
Iipython opened this issue Feb 29, 2024 · 5 comments
Open

the code of new version is using more VRAM #1141

Iipython opened this issue Feb 29, 2024 · 5 comments

Comments

@Iipython
Copy link

Hey, I am encountering the same problem today!!
I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded in 2024, Feb 28.
But I found the new code always reported"out of memory" by using the same configuration as follows:
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes:
the VRAM occupation with new code:
0ed4b03598426da65535656a625367f

the VRAM occupation with old code:
da54cdaeb7aa6b8f4f82c2c7a3f7920

why? what is different?

@kohya-ss
Copy link
Owner

The PR #989 was merged into main branch 21, Dec 2023. I think it may cause this issue. Please add --ddp_gradient_as_bucket_view and --ddp_static_graph to reduce VRAM usage with multi GPU training.

@storuky
Copy link

storuky commented Mar 21, 2024

@kohya-ss these flags didn't help me with multi-gpu training... I have 3x4090 onboard.
This is how I run sdxl_train.py with single GPU (I setup it for accelerate as well):

accelerate launch --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

It's going well.
But when I configure accelerate to use 3 gpu and run this command with --multi_gpu and --num_processes=3:

accelerate launch --num_processes=3 --multi_gpu --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant"  --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

I'm getting OOM error. I tried to add --ddp_gradient_as_bucket_view and --ddp_static_graph as you mentioned but I'm still getting OOM.

@storuky
Copy link

storuky commented Mar 21, 2024

I reverted this PR locally and now it uses less VRAM.

@kohya-ss
Copy link
Owner

PR #989 fixes gradients synchronization. If #989 is reverted, the gradient is not synchronized, so it is similar to single GPU training in my understanding.

I'm not familiar with multiple GPU training, but could you try the training with --full_bf16 option? If it works, there might be some overhead in multi GPU training, and 24gb may not be sufficient.

@storuky
Copy link

storuky commented Mar 21, 2024

@kohya-ss yes, full_bf16 works well (if we talk about VRAM usage) but it has much worse results in terms of accuracy 🤷‍♂️ For example, hair sticks together as dirty. Small detailed objects turn into blots... etc...
Probably, full_bf16 may need another optimizer/LR/Scheduler setup... Do you have some handy notes of what we need to know about full_bf16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants