LoRA support #72

tdrussell · 2024-12-08T23:48:25Z

This adds LoRA support. I have a training script that can train loras, and I will open source the code in the next few days. This gets the lora support ready. I've already tested it and can confirm that it works as expected with the test lora I've trained.

That being said, I'm not familiar with ComfyUI internals AT ALL. If this change is doing something wrong, please let me know.

I had to remove the mm.unload_all_models() call from the sampler. The transformer is now a normal ComfyUI model, so it would get offloaded to CPU before the denoising loop without that change. I don't know why it was needed in the first place, since when force_offload is set, it appears that every model is moved back to CPU when it's done being used.

toyxyz · 2024-12-09T12:11:34Z

Excellent, thank you for your wonderful work!

I was wondering if it is possible to train HunyuanVideo's LoRA with 24gb.

kijai · 2024-12-09T14:09:41Z

Hey, appreciate the support! But I don't really see how this would allow loading for LoRAs, as a wrapper it doesn't use Comfy sampling and thus weight patching etc. When you say it works, do you mean the LoRA weights actually get loaded and affect the output?

As a diffusers wrapper I think we'd have to use diffusers LoRA loading, I've previously done that for CogVideoX so it's probably not an issue to add in similar manner.

tdrussell · 2024-12-09T19:27:37Z

Yes, it actually works and affects the output. I realize I should provide more proof than "just trust me bro". Here is a test lora you can use to confirm. It was trained on a very small number of black-and-white photos with no captions, so it will make everything grayscale even without mentioning that in the prompt. You may need to bump the strength up to 1.2 or 1.4, but you'll see it works for both images and videos.

I was also surprised at first, but from reading through the ComfyUI code it looks like as long as the model is an attribute of a BaseModel, and is wrapped in a ModelPatcher, the normal lora loader node can target any nn.Linear layer.

DocShotgun · 2024-12-10T00:40:37Z

What does the hardware requirement look like for training loras?

I'm assuming Hunyuan is still in a very unoptimized state currently (it takes me like ~33gb of peak VRAM usage and 50 minutes to generate 720x1280x129f on an A6000).

tdrussell · 2024-12-10T03:20:31Z

For image training, the memory requirements are similar to Flux. You can have the model in fp8, and basically any reasonable resolution easily fits in 24gb. The model learns quite well from images alone, and is able to generalize that to video.

For video, 512x512 at 33 frames uses 22.7gb for a rank 32 lora. At some point I want to implement ring attention and/or Deepspeed Ulysses to parallelize over the sequence dimension, which should let you train on larger and longer videos if you have multiple GPUs. But right now I'm working on the final changes to release the code I have so far.

rishipandey125 · 2024-12-10T03:39:24Z

Can you use the image model training script to guide the overall style of videos?

I am guessing training video loras impacts the outputted motion more.

toyxyz · 2024-12-10T03:51:45Z

For image training, the memory requirements are similar to Flux. You can have the model in fp8, and basically any reasonable resolution easily fits in 24gb. The model learns quite well from images alone, and is able to generalize that to video.

For video, 512x512 at 33 frames uses 22.7gb for a rank 32 lora. At some point I want to implement ring attention and/or Deepspeed Ulysses to parallelize over the sequence dimension, which should let you train on larger and longer videos if you have multiple GPUs. But right now I'm working on the final changes to release the code I have so far.

That's very interesting! Can you show us a video you created using Lora?

kijai · 2024-12-10T08:04:34Z

Alright, I admit I was wrong with my doubts and load_models_gpu actually does apply the weights. It's very interesting approach, just needs more testing to see how wrapping model like this with the model patcher affects everything else, never done it before but I have thought about it.

Couple of issues I noticed initially, as with base Comfy LoRAs, it doesn't work with torch.compile (I've used a workaround for this before, it's somewhat clumsy but maybe applicable here as well). Also first run for some reason uses ~4GB more VRAM, but on 2nd run it's fine, so there are some memory management issues to solve.

tdrussell · 2024-12-10T18:07:35Z

What settings are you using? I'm not seeing a VRAM increase. This is with fp8_e4m3fn_fast and flash_attn_varlen. Without lora, first and second run:

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:47<00:00, 5.59s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 19.4 gb

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:47<00:00, 5.59s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 18.9 gb

With lora:
Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:46<00:00, 5.54s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 19.4 gb

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:49<00:00, 5.65s/it]
Allocated memory: memory=12.277 GB
Max allocated memory: max_memory=16.226 GB
Max reserved memory: max_reserved=17.188 GB
nvtop: 18.9 gb

I also report the peak VRAM use I see in nvtop during the denoising. In both cases, nvtop shows ~500MB less the second run. Interestingly, the logged memory usage with the lora, on the second run, is also slightly lower. But not without the lora. At any rate they are almost identical with/without lora.

EDIT: I should also add I'm using linux if that matters.

WangRongsheng · 2024-12-11T03:32:38Z

@tdrussell Hi, I would like to know if your script supports lora fine-tuning hunyuanvideo and is being used independently, in fact I am not very good at using ComfyUI. i think it is important to be able to train and reason independently about hunyuanvideo.

tdrussell · 2024-12-11T17:09:49Z

LoRA training code: https://github.com/tdrussell/diffusion-pipe

NeilsMabet · 2024-12-11T22:09:44Z

LoRA training code: https://github.com/tdrussell/diffusion-pipe

Hi, i kinda trained a model for 20 epochs for testing using your diffusion pipe script, but i'm lost how can i apply that to Comfy-ui and wrapper. Its there a specific node that i should use in this branch?

kijai · 2024-12-11T22:14:23Z

LoRA training code: https://github.com/tdrussell/diffusion-pipe

Hi, i kinda trained a model for 20 epochs for testing using your diffusion pipe script, but i'm lost how can i apply that to Comfy-ui and wrapper. Its there a specific node that i should use in this branch?

You just use normal comfy model only LoRa loader.

zaitaxiang2012 · 2024-12-12T03:25:18Z

Mode type mismatch, unable to link. I modified the type to MODEL myself, but it doesn't work.

tdrussell · 2024-12-12T05:17:45Z

You shouldn't have to modify anything yourself. Did you checkout this PR? You need to do that, there's more changes than just the model type.

kijai · 2024-12-12T08:36:02Z

The memory issue seems to happen when the loader node is set to load the model on main_device and LoRA is loaded, probably leaves the original weights on GPU or something like that, after first offload it clears it.

For the compilation issue.. I really don't know, it's the same issue as ComfyUI innately has with LoRAs and compile, I have the hacky solution to that (which I just updated so it works again), it's not optimal and is prone to break with ComfyUI updates.

I'm still on the fence whether it'd be better to just use the diffusers LoRA loading/fusing as when the weights are fused, it's agnostic to everything else that changes the model, allowing compile/quantization etc. with no issues.

Bringing this whole thing to Comfy natively would take lot more work and would be pain to maintain, unless they picked it up as part of the core, possibly they already have plans for that.

OrphBean · 2024-12-12T14:23:57Z

@kijai I asked on their git and they referred me back to here - so i encourage you to ask them or assume its not a thing in the short term at least

kijai · 2024-12-12T15:18:48Z

I think I'll do this as a compromise, I do like the comfy lora loading over diffusers, just the patching way to do it causes issues with compilation etc.

This still would use the Comfy LoRA loading like this PR outlined, however the LoRA would be loaded in the ModelLoader node instead, this allows loading it before compilation, thus making compile work with LoRAs.

The downside is that switching LoRAs require reloading whole model. I also don't really want to make the model output/input into the comfy MODEL -type as that creates the false assumption that it can work with other MODEL -nodes (besides couple of expections like LoRA loader of course), when it doesn't as this is a wrapper.

tdrussell · 2024-12-12T16:14:48Z

Sounds good. One thing we should also decide, is what should the key format in the lora state_dict look like? Right now the training script makes it so every key has a "diffusion_model." prefix, since that works with ComfyUI lora loading out of the box with no changes. So like this:

diffusion_model.single_blocks.13.linear1.lora_A.weight
diffusion_model.single_blocks.13.linear1.lora_B.weight
diffusion_model.single_blocks.13.linear2.lora_A.weight
diffusion_model.single_blocks.13.linear2.lora_B.weight
etc

But, Diffusers has it's own format, although HunyuanVideo doesn't have official Diffusers integration yet. I think the convention is to have the prefix be the name of the attribute on the Diffusers pipeline object. I.e. in this case it would be:
transformer.single_blocks.13.linear1.lora_A.weight
transformer.single_blocks.13.linear1.lora_B.weight

This would need changes in ComfyUI code. E.g. you can find in the code where that prefix is handled specially for Flux so it can load plain Diffusers-trained loras. There's already a couple people starting to make HunyuanVideo loras so we should resolve the format soon.

kijai · 2024-12-12T16:48:40Z

Sounds good. One thing we should also decide, is what should the key format in the lora state_dict look like? Right now the training script makes it so every key has a "diffusion_model." prefix, since that works with ComfyUI lora loading out of the box with no changes. So like this:

diffusion_model.single_blocks.13.linear1.lora_A.weight diffusion_model.single_blocks.13.linear1.lora_B.weight diffusion_model.single_blocks.13.linear2.lora_A.weight diffusion_model.single_blocks.13.linear2.lora_B.weight etc

But, Diffusers has it's own format, although HunyuanVideo doesn't have official Diffusers integration yet. I think the convention is to have the prefix be the name of the attribute on the Diffusers pipeline object. I.e. in this case it would be: transformer.single_blocks.13.linear1.lora_A.weight transformer.single_blocks.13.linear1.lora_B.weight

This would need changes in ComfyUI code. E.g. you can find in the code where that prefix is handled specially for Flux so it can load plain Diffusers-trained loras. There's already a couple people starting to make HunyuanVideo loras so we should resolve the format soon.

Yes for example CogVideoX -loras are transformer.transformer_blocks. (it only has one type, no single/double) so that sounds about right. With my suggested change it's easy to load any prefix format without having to touch Comfy code too.

kijai · 2024-12-12T19:07:55Z

@tdrussell thanks a lot for all this, I merged with the changes I mentioned, I can adjust the prefix etc. in the future if needed

slavakurilyak · 2024-12-13T02:58:19Z

It would be great to see an updated README based on LoRA support

Celtmant · 2024-12-13T21:49:58Z

hi I see you've updated it to use LoRa, but here's what I noticed, according to the new workflow, RAM consumption is just off the scale...personally, everything goes into space for me. In the last workflow, everything was straight smooth...

kijai · 2024-12-13T22:48:24Z

hi I see you've updated it to use LoRa, but here's what I noticed, according to the new workflow, RAM consumption is just off the scale...personally, everything goes into space for me. In the last workflow, everything was straight smooth...

The scale being what? Loading the LLM without quantization along with the model always has taken a lot of RAM, the nodes have options to juggle what is offloaded and what is kept in VRAM that you can use to balance it for your system, overall RAM usage has not increased. To reduce overall RAM usage the best thing is to use the bnb_nf4 quantization, it has very minimal effect on quality, if at all.

rayryeng · 2024-12-14T13:27:57Z

Would it be possible to have another example JSON that shows how LoRA would fit in the ComfyUI workflow? I'm very new to this landscape and I'm not sure what blocks I need to add given what I've seen in this conversation. Is it as simple as a LoRA loader after loading in the diffusion model?

kijai · 2024-12-14T13:29:57Z

Would it be possible to have another example JSON that shows how LoRA would fit in the ComfyUI workflow? I'm very new to this landscape and I'm not sure what blocks I need to add given what I've seen in this conversation. Is it as simple as a LoRA loader after loading in the diffusion model?

In the current main version you simply add a LoRa select node to the model loader node, it has a LoRA input now.

rayryeng · 2024-12-14T13:32:32Z

@kijai Thank you so much for your reply!

vtrixlaz · 2024-12-25T18:13:36Z

Sorry for the newby question, and thanks @kijai for your amazing work. How would you train a Lora for Hunyuan and would it be Flux or Stable Diffusion?

rayryeng · 2024-12-25T22:33:50Z

Sorry for the newby question, and thanks @kijai for your amazing work. How would you train a Lora for Hunyuan and would it be Flux or Stable Diffusion?

There is a discussion thread here. The OP of the PR merged the LoRA node in here and referenced his training repo to train LoRA to plug in to HunyuanVideo but there is no official README about it on this repo.

As for the actual architecture LoRA is applied to, it supports Flux, Stable Diffusion and HunyuanVideo. The LoRA being trained in question is for HunyuanVideo.

tdrussell/diffusion-pipe#6

LoRA support

9cc751a

Merge remote-tracking branch 'origin/main' into lora

de1a04e

OrphBean mentioned this pull request Dec 10, 2024

Lora Tencent/HunyuanVideo#104

Open

Merge remote-tracking branch 'origin/main' into lora

225518b

Move LoRA loading to model loader

5f5b71d

kijai merged commit 7d824f7 into kijai:main Dec 12, 2024

tdrussell deleted the lora branch December 13, 2024 02:07

slavakurilyak mentioned this pull request Dec 13, 2024

Update README tdrussell/diffusion-pipe#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA support #72

LoRA support #72

tdrussell commented Dec 8, 2024

toyxyz commented Dec 9, 2024

kijai commented Dec 9, 2024

tdrussell commented Dec 9, 2024

DocShotgun commented Dec 10, 2024

tdrussell commented Dec 10, 2024

rishipandey125 commented Dec 10, 2024

toyxyz commented Dec 10, 2024

kijai commented Dec 10, 2024

tdrussell commented Dec 10, 2024 •

edited

Loading

WangRongsheng commented Dec 11, 2024

tdrussell commented Dec 11, 2024

NeilsMabet commented Dec 11, 2024

kijai commented Dec 11, 2024

zaitaxiang2012 commented Dec 12, 2024

tdrussell commented Dec 12, 2024

kijai commented Dec 12, 2024

OrphBean commented Dec 12, 2024 •

edited

Loading

kijai commented Dec 12, 2024

tdrussell commented Dec 12, 2024

kijai commented Dec 12, 2024

kijai commented Dec 12, 2024

slavakurilyak commented Dec 13, 2024

Celtmant commented Dec 13, 2024 •

edited

Loading

kijai commented Dec 13, 2024

rayryeng commented Dec 14, 2024

kijai commented Dec 14, 2024

rayryeng commented Dec 14, 2024

vtrixlaz commented Dec 25, 2024

rayryeng commented Dec 25, 2024 •

edited

Loading

LoRA support #72

LoRA support #72

Conversation

tdrussell commented Dec 8, 2024

toyxyz commented Dec 9, 2024

kijai commented Dec 9, 2024

tdrussell commented Dec 9, 2024

DocShotgun commented Dec 10, 2024

tdrussell commented Dec 10, 2024

rishipandey125 commented Dec 10, 2024

toyxyz commented Dec 10, 2024

kijai commented Dec 10, 2024

tdrussell commented Dec 10, 2024 • edited Loading

WangRongsheng commented Dec 11, 2024

tdrussell commented Dec 11, 2024

NeilsMabet commented Dec 11, 2024

kijai commented Dec 11, 2024

zaitaxiang2012 commented Dec 12, 2024

tdrussell commented Dec 12, 2024

kijai commented Dec 12, 2024

OrphBean commented Dec 12, 2024 • edited Loading

kijai commented Dec 12, 2024

tdrussell commented Dec 12, 2024

kijai commented Dec 12, 2024

kijai commented Dec 12, 2024

slavakurilyak commented Dec 13, 2024

Celtmant commented Dec 13, 2024 • edited Loading

kijai commented Dec 13, 2024

rayryeng commented Dec 14, 2024

kijai commented Dec 14, 2024

rayryeng commented Dec 14, 2024

vtrixlaz commented Dec 25, 2024

rayryeng commented Dec 25, 2024 • edited Loading

tdrussell commented Dec 10, 2024 •

edited

Loading

OrphBean commented Dec 12, 2024 •

edited

Loading

Celtmant commented Dec 13, 2024 •

edited

Loading

rayryeng commented Dec 25, 2024 •

edited

Loading