Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA support #72

Merged
merged 4 commits into from
Dec 12, 2024
Merged

LoRA support #72

merged 4 commits into from
Dec 12, 2024

Conversation

tdrussell
Copy link
Contributor

This adds LoRA support. I have a training script that can train loras, and I will open source the code in the next few days. This gets the lora support ready. I've already tested it and can confirm that it works as expected with the test lora I've trained.

That being said, I'm not familiar with ComfyUI internals AT ALL. If this change is doing something wrong, please let me know.

I had to remove the mm.unload_all_models() call from the sampler. The transformer is now a normal ComfyUI model, so it would get offloaded to CPU before the denoising loop without that change. I don't know why it was needed in the first place, since when force_offload is set, it appears that every model is moved back to CPU when it's done being used.

@toyxyz
Copy link

toyxyz commented Dec 9, 2024

Excellent, thank you for your wonderful work!

I was wondering if it is possible to train HunyuanVideo's LoRA with 24gb.

@kijai
Copy link
Owner

kijai commented Dec 9, 2024

Hey, appreciate the support! But I don't really see how this would allow loading for LoRAs, as a wrapper it doesn't use Comfy sampling and thus weight patching etc. When you say it works, do you mean the LoRA weights actually get loaded and affect the output?

As a diffusers wrapper I think we'd have to use diffusers LoRA loading, I've previously done that for CogVideoX so it's probably not an issue to add in similar manner.

@tdrussell
Copy link
Contributor Author

Yes, it actually works and affects the output. I realize I should provide more proof than "just trust me bro". Here is a test lora you can use to confirm. It was trained on a very small number of black-and-white photos with no captions, so it will make everything grayscale even without mentioning that in the prompt. You may need to bump the strength up to 1.2 or 1.4, but you'll see it works for both images and videos.

I was also surprised at first, but from reading through the ComfyUI code it looks like as long as the model is an attribute of a BaseModel, and is wrapped in a ModelPatcher, the normal lora loader node can target any nn.Linear layer.

@DocShotgun
Copy link

What does the hardware requirement look like for training loras?

I'm assuming Hunyuan is still in a very unoptimized state currently (it takes me like ~33gb of peak VRAM usage and 50 minutes to generate 720x1280x129f on an A6000).

@tdrussell
Copy link
Contributor Author

For image training, the memory requirements are similar to Flux. You can have the model in fp8, and basically any reasonable resolution easily fits in 24gb. The model learns quite well from images alone, and is able to generalize that to video.

For video, 512x512 at 33 frames uses 22.7gb for a rank 32 lora. At some point I want to implement ring attention and/or Deepspeed Ulysses to parallelize over the sequence dimension, which should let you train on larger and longer videos if you have multiple GPUs. But right now I'm working on the final changes to release the code I have so far.

@rishipandey125
Copy link

Can you use the image model training script to guide the overall style of videos?

I am guessing training video loras impacts the outputted motion more.

@toyxyz
Copy link

toyxyz commented Dec 10, 2024

For image training, the memory requirements are similar to Flux. You can have the model in fp8, and basically any reasonable resolution easily fits in 24gb. The model learns quite well from images alone, and is able to generalize that to video.

For video, 512x512 at 33 frames uses 22.7gb for a rank 32 lora. At some point I want to implement ring attention and/or Deepspeed Ulysses to parallelize over the sequence dimension, which should let you train on larger and longer videos if you have multiple GPUs. But right now I'm working on the final changes to release the code I have so far.

That's very interesting! Can you show us a video you created using Lora?

@kijai
Copy link
Owner

kijai commented Dec 10, 2024

Alright, I admit I was wrong with my doubts and load_models_gpu actually does apply the weights. It's very interesting approach, just needs more testing to see how wrapping model like this with the model patcher affects everything else, never done it before but I have thought about it.

Couple of issues I noticed initially, as with base Comfy LoRAs, it doesn't work with torch.compile (I've used a workaround for this before, it's somewhat clumsy but maybe applicable here as well). Also first run for some reason uses ~4GB more VRAM, but on 2nd run it's fine, so there are some memory management issues to solve.

@tdrussell
Copy link
Contributor Author

tdrussell commented Dec 10, 2024

What settings are you using? I'm not seeing a VRAM increase. This is with fp8_e4m3fn_fast and flash_attn_varlen. Without lora, first and second run:

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:47<00:00, 5.59s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 19.4 gb

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:47<00:00, 5.59s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 18.9 gb

With lora:
Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:46<00:00, 5.54s/it]
Allocated memory: memory=12.737 GB
Max allocated memory: max_memory=16.686 GB
Max reserved memory: max_reserved=17.656 GB
nvtop: 19.4 gb

Sampling 65 frames in 17 latents at 512x720 with 30 inference steps
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:49<00:00, 5.65s/it]
Allocated memory: memory=12.277 GB
Max allocated memory: max_memory=16.226 GB
Max reserved memory: max_reserved=17.188 GB
nvtop: 18.9 gb

I also report the peak VRAM use I see in nvtop during the denoising. In both cases, nvtop shows ~500MB less the second run. Interestingly, the logged memory usage with the lora, on the second run, is also slightly lower. But not without the lora. At any rate they are almost identical with/without lora.

EDIT: I should also add I'm using linux if that matters.

@WangRongsheng
Copy link

@tdrussell Hi, I would like to know if your script supports lora fine-tuning hunyuanvideo and is being used independently, in fact I am not very good at using ComfyUI. i think it is important to be able to train and reason independently about hunyuanvideo.

@tdrussell
Copy link
Contributor Author

LoRA training code: https://github.com/tdrussell/diffusion-pipe

@NeilsMabet
Copy link

LoRA training code: https://github.com/tdrussell/diffusion-pipe

Hi, i kinda trained a model for 20 epochs for testing using your diffusion pipe script, but i'm lost how can i apply that to Comfy-ui and wrapper. Its there a specific node that i should use in this branch?

@kijai
Copy link
Owner

kijai commented Dec 11, 2024

LoRA training code: https://github.com/tdrussell/diffusion-pipe

Hi, i kinda trained a model for 20 epochs for testing using your diffusion pipe script, but i'm lost how can i apply that to Comfy-ui and wrapper. Its there a specific node that i should use in this branch?

You just use normal comfy model only LoRa loader.

@zaitaxiang2012
Copy link

image

Mode type mismatch, unable to link. I modified the type to MODEL myself, but it doesn't work.

@tdrussell
Copy link
Contributor Author

You shouldn't have to modify anything yourself. Did you checkout this PR? You need to do that, there's more changes than just the model type.

@kijai
Copy link
Owner

kijai commented Dec 12, 2024

The memory issue seems to happen when the loader node is set to load the model on main_device and LoRA is loaded, probably leaves the original weights on GPU or something like that, after first offload it clears it.

For the compilation issue.. I really don't know, it's the same issue as ComfyUI innately has with LoRAs and compile, I have the hacky solution to that (which I just updated so it works again), it's not optimal and is prone to break with ComfyUI updates.

I'm still on the fence whether it'd be better to just use the diffusers LoRA loading/fusing as when the weights are fused, it's agnostic to everything else that changes the model, allowing compile/quantization etc. with no issues.

Bringing this whole thing to Comfy natively would take lot more work and would be pain to maintain, unless they picked it up as part of the core, possibly they already have plans for that.

@OrphBean
Copy link

OrphBean commented Dec 12, 2024

@kijai I asked on their git and they referred me back to here - so i encourage you to ask them or assume its not a thing in the short term at least

@kijai
Copy link
Owner

kijai commented Dec 12, 2024

I think I'll do this as a compromise, I do like the comfy lora loading over diffusers, just the patching way to do it causes issues with compilation etc.

image

This still would use the Comfy LoRA loading like this PR outlined, however the LoRA would be loaded in the ModelLoader node instead, this allows loading it before compilation, thus making compile work with LoRAs.

The downside is that switching LoRAs require reloading whole model. I also don't really want to make the model output/input into the comfy MODEL -type as that creates the false assumption that it can work with other MODEL -nodes (besides couple of expections like LoRA loader of course), when it doesn't as this is a wrapper.

@tdrussell
Copy link
Contributor Author

Sounds good. One thing we should also decide, is what should the key format in the lora state_dict look like? Right now the training script makes it so every key has a "diffusion_model." prefix, since that works with ComfyUI lora loading out of the box with no changes. So like this:

diffusion_model.single_blocks.13.linear1.lora_A.weight
diffusion_model.single_blocks.13.linear1.lora_B.weight
diffusion_model.single_blocks.13.linear2.lora_A.weight
diffusion_model.single_blocks.13.linear2.lora_B.weight
etc

But, Diffusers has it's own format, although HunyuanVideo doesn't have official Diffusers integration yet. I think the convention is to have the prefix be the name of the attribute on the Diffusers pipeline object. I.e. in this case it would be:
transformer.single_blocks.13.linear1.lora_A.weight
transformer.single_blocks.13.linear1.lora_B.weight

This would need changes in ComfyUI code. E.g. you can find in the code where that prefix is handled specially for Flux so it can load plain Diffusers-trained loras. There's already a couple people starting to make HunyuanVideo loras so we should resolve the format soon.

@kijai
Copy link
Owner

kijai commented Dec 12, 2024

Sounds good. One thing we should also decide, is what should the key format in the lora state_dict look like? Right now the training script makes it so every key has a "diffusion_model." prefix, since that works with ComfyUI lora loading out of the box with no changes. So like this:

diffusion_model.single_blocks.13.linear1.lora_A.weight diffusion_model.single_blocks.13.linear1.lora_B.weight diffusion_model.single_blocks.13.linear2.lora_A.weight diffusion_model.single_blocks.13.linear2.lora_B.weight etc

But, Diffusers has it's own format, although HunyuanVideo doesn't have official Diffusers integration yet. I think the convention is to have the prefix be the name of the attribute on the Diffusers pipeline object. I.e. in this case it would be: transformer.single_blocks.13.linear1.lora_A.weight transformer.single_blocks.13.linear1.lora_B.weight

This would need changes in ComfyUI code. E.g. you can find in the code where that prefix is handled specially for Flux so it can load plain Diffusers-trained loras. There's already a couple people starting to make HunyuanVideo loras so we should resolve the format soon.

Yes for example CogVideoX -loras are transformer.transformer_blocks. (it only has one type, no single/double) so that sounds about right. With my suggested change it's easy to load any prefix format without having to touch Comfy code too.

@kijai kijai merged commit 7d824f7 into kijai:main Dec 12, 2024
@kijai
Copy link
Owner

kijai commented Dec 12, 2024

@tdrussell thanks a lot for all this, I merged with the changes I mentioned, I can adjust the prefix etc. in the future if needed

@tdrussell tdrussell deleted the lora branch December 13, 2024 02:07
@slavakurilyak
Copy link

It would be great to see an updated README based on LoRA support

@Celtmant
Copy link

Celtmant commented Dec 13, 2024

hi I see you've updated it to use LoRa, but here's what I noticed, according to the new workflow, RAM consumption is just off the scale...personally, everything goes into space for me. In the last workflow, everything was straight smooth...

@kijai
Copy link
Owner

kijai commented Dec 13, 2024

hi I see you've updated it to use LoRa, but here's what I noticed, according to the new workflow, RAM consumption is just off the scale...personally, everything goes into space for me. In the last workflow, everything was straight smooth...

The scale being what? Loading the LLM without quantization along with the model always has taken a lot of RAM, the nodes have options to juggle what is offloaded and what is kept in VRAM that you can use to balance it for your system, overall RAM usage has not increased. To reduce overall RAM usage the best thing is to use the bnb_nf4 quantization, it has very minimal effect on quality, if at all.

@rayryeng
Copy link

Would it be possible to have another example JSON that shows how LoRA would fit in the ComfyUI workflow? I'm very new to this landscape and I'm not sure what blocks I need to add given what I've seen in this conversation. Is it as simple as a LoRA loader after loading in the diffusion model?

@kijai
Copy link
Owner

kijai commented Dec 14, 2024

Would it be possible to have another example JSON that shows how LoRA would fit in the ComfyUI workflow? I'm very new to this landscape and I'm not sure what blocks I need to add given what I've seen in this conversation. Is it as simple as a LoRA loader after loading in the diffusion model?

In the current main version you simply add a LoRa select node to the model loader node, it has a LoRA input now.

@rayryeng
Copy link

@kijai Thank you so much for your reply!

@vtrixlaz
Copy link

Sorry for the newby question, and thanks @kijai for your amazing work. How would you train a Lora for Hunyuan and would it be Flux or Stable Diffusion?

@rayryeng
Copy link

rayryeng commented Dec 25, 2024

Sorry for the newby question, and thanks @kijai for your amazing work. How would you train a Lora for Hunyuan and would it be Flux or Stable Diffusion?

There is a discussion thread here. The OP of the PR merged the LoRA node in here and referenced his training repo to train LoRA to plug in to HunyuanVideo but there is no official README about it on this repo.

As for the actual architecture LoRA is applied to, it supports Flux, Stable Diffusion and HunyuanVideo. The LoRA being trained in question is for HunyuanVideo.

tdrussell/diffusion-pipe#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.