-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are there any way to reduce system RAM usage? I am trying to make a tutorial for free colab SDXL LoRA training #788
Comments
Kaggle provides 2 GPU can we load unet and text encoder into one of the GPUs with --lowram option? |
|
@FurkanGozukara If you use kaggle to train SDXL LoRA, I recommend to use converted pretrained checkpoint from HF repo instead of pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" Due to the limited RAM, it's a little bit too 'extreme' for kaggle's kernel to initialize models on 2 GPU. |
Here is the test code that simulates model loading from HF repo in scripts. Without operations cost RAM like cache latents, it cost about 7.2GB RAM to load model on the first GPU and 11.1GB on the second GPU. from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from accelerate.utils.modeling import set_module_tensor_to_device
from diffusers import StableDiffusionXLPipeline
import torch
import gc
from library import sdxl_model_util, sdxl_original_unet
def load_target_model(device):
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
text_encoder1 = pipe.text_encoder
text_encoder2 = pipe.text_encoder_2
vae = pipe.vae
unet = pipe.unet
del pipe
gc.collect()
print("convert U-Net to original U-Net")
state_dict = sdxl_model_util.convert_diffusers_unet_state_dict_to_sdxl(unet.state_dict())
with init_empty_weights():
unet = sdxl_original_unet.SdxlUNet2DConditionModel()
for k in list(state_dict.keys()):
set_module_tensor_to_device(unet, k, device, value=state_dict.pop(k))
print("U-Net converted to original U-Net")
return text_encoder1, text_encoder2, unet
for device in ["cuda:0","cuda:1"]:
text_encoder1, text_encoder2, unet = load_target_model(device)
text_encoder1.to(device)
text_encoder2.to(device) |
I am trying to make this work for gui verision So what is the logic of loading diffusers? I just need to give stability ai repo name? Or perhaps any cli arguments to load into 2 GPUs? |
It has been merged in #676. It initialized unet in So just need to change pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" (I'm not sure whether loading from HF repo works on bmaltais's gui or not, I seldom used the gui version) (;´ヮ`)7 |
thank you i tested this it is better than before at least starting to load into vram but still crashes tried both with --lowram and without it and with all optimizations when --lowram used i think this time crash happens due to out of vram error |
As I presented above, using multi-gpu to train a LoRA with only 47 images on kaggle costs about 12.8GB RAM. Because it has cost about 11.1GB RAM to load models on two GPU (I think it's hard to optimize this part's RAM usage more at least). So if the dataset is large (normally 1000+ images), the RAM usage will reach 13GB to crash the kernel easily. However, it seems that the training can work normally on one GPU. So I think we can enable only one GPU on kaggle to make sure the training can work steadily. |
Maybe you can cache latents to disk for your dataset before starting a training. Once the Anyway, a possible way to reduce RAM and VRAM is intergrating |
@Isotr0py so how to cache into disk before starting? |
@FurkanGozukara Use |
@FurkanGozukara The PR should reduce RAM and VRAM usage with The training should run normally with caching latents on kaggle. |
thank you so much i will test when the @bmaltais pulls it into gradio version |
@Isotr0py amazing update - it was just merged into kohya GUI today here i made a speed comparison https://twitter.com/GozukaraFurkan/status/1698471340032872721 I also updated my tutorial github readme file How To Do SDXL LoRA Training On RunPod With Kohya SS GUI Trainer & Use LoRAs With Automatic1111 UI |
Hello. I am trying to make a tutorial for free colab
It has good GPU but the problem is system ram is only 13 GB
We need some options or way to reduce RAM usage
The text was updated successfully, but these errors were encountered: