-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA support #72
LoRA support #72
Conversation
Excellent, thank you for your wonderful work! I was wondering if it is possible to train HunyuanVideo's LoRA with 24gb. |
Hey, appreciate the support! But I don't really see how this would allow loading for LoRAs, as a wrapper it doesn't use Comfy sampling and thus weight patching etc. When you say it works, do you mean the LoRA weights actually get loaded and affect the output? As a diffusers wrapper I think we'd have to use diffusers LoRA loading, I've previously done that for CogVideoX so it's probably not an issue to add in similar manner. |
Yes, it actually works and affects the output. I realize I should provide more proof than "just trust me bro". Here is a test lora you can use to confirm. It was trained on a very small number of black-and-white photos with no captions, so it will make everything grayscale even without mentioning that in the prompt. You may need to bump the strength up to 1.2 or 1.4, but you'll see it works for both images and videos. I was also surprised at first, but from reading through the ComfyUI code it looks like as long as the model is an attribute of a BaseModel, and is wrapped in a ModelPatcher, the normal lora loader node can target any nn.Linear layer. |
What does the hardware requirement look like for training loras? I'm assuming Hunyuan is still in a very unoptimized state currently (it takes me like ~33gb of peak VRAM usage and 50 minutes to generate 720x1280x129f on an A6000). |
For image training, the memory requirements are similar to Flux. You can have the model in fp8, and basically any reasonable resolution easily fits in 24gb. The model learns quite well from images alone, and is able to generalize that to video. For video, 512x512 at 33 frames uses 22.7gb for a rank 32 lora. At some point I want to implement ring attention and/or Deepspeed Ulysses to parallelize over the sequence dimension, which should let you train on larger and longer videos if you have multiple GPUs. But right now I'm working on the final changes to release the code I have so far. |
Can you use the image model training script to guide the overall style of videos? I am guessing training video loras impacts the outputted motion more. |
That's very interesting! Can you show us a video you created using Lora? |
Alright, I admit I was wrong with my doubts and Couple of issues I noticed initially, as with base Comfy LoRAs, it doesn't work with torch.compile (I've used a workaround for this before, it's somewhat clumsy but maybe applicable here as well). Also first run for some reason uses ~4GB more VRAM, but on 2nd run it's fine, so there are some memory management issues to solve. |
What settings are you using? I'm not seeing a VRAM increase. This is with fp8_e4m3fn_fast and flash_attn_varlen. Without lora, first and second run: Sampling 65 frames in 17 latents at 512x720 with 30 inference steps Sampling 65 frames in 17 latents at 512x720 with 30 inference steps With lora: Sampling 65 frames in 17 latents at 512x720 with 30 inference steps I also report the peak VRAM use I see in nvtop during the denoising. In both cases, nvtop shows ~500MB less the second run. Interestingly, the logged memory usage with the lora, on the second run, is also slightly lower. But not without the lora. At any rate they are almost identical with/without lora. EDIT: I should also add I'm using linux if that matters. |
@tdrussell Hi, I would like to know if your script supports lora fine-tuning hunyuanvideo and is being used independently, in fact I am not very good at using ComfyUI. i think it is important to be able to train and reason independently about hunyuanvideo. |
LoRA training code: https://github.com/tdrussell/diffusion-pipe |
Hi, i kinda trained a model for 20 epochs for testing using your diffusion pipe script, but i'm lost how can i apply that to Comfy-ui and wrapper. Its there a specific node that i should use in this branch? |
You just use normal comfy model only LoRa loader. |
You shouldn't have to modify anything yourself. Did you checkout this PR? You need to do that, there's more changes than just the model type. |
The memory issue seems to happen when the loader node is set to load the model on main_device and LoRA is loaded, probably leaves the original weights on GPU or something like that, after first offload it clears it. For the compilation issue.. I really don't know, it's the same issue as ComfyUI innately has with LoRAs and compile, I have the hacky solution to that (which I just updated so it works again), it's not optimal and is prone to break with ComfyUI updates. I'm still on the fence whether it'd be better to just use the diffusers LoRA loading/fusing as when the weights are fused, it's agnostic to everything else that changes the model, allowing compile/quantization etc. with no issues. Bringing this whole thing to Comfy natively would take lot more work and would be pain to maintain, unless they picked it up as part of the core, possibly they already have plans for that. |
@kijai I asked on their git and they referred me back to here - so i encourage you to ask them or assume its not a thing in the short term at least |
I think I'll do this as a compromise, I do like the comfy lora loading over diffusers, just the patching way to do it causes issues with compilation etc. This still would use the Comfy LoRA loading like this PR outlined, however the LoRA would be loaded in the ModelLoader node instead, this allows loading it before compilation, thus making compile work with LoRAs. The downside is that switching LoRAs require reloading whole model. I also don't really want to make the model output/input into the comfy MODEL -type as that creates the false assumption that it can work with other MODEL -nodes (besides couple of expections like LoRA loader of course), when it doesn't as this is a wrapper. |
Sounds good. One thing we should also decide, is what should the key format in the lora state_dict look like? Right now the training script makes it so every key has a "diffusion_model." prefix, since that works with ComfyUI lora loading out of the box with no changes. So like this: diffusion_model.single_blocks.13.linear1.lora_A.weight But, Diffusers has it's own format, although HunyuanVideo doesn't have official Diffusers integration yet. I think the convention is to have the prefix be the name of the attribute on the Diffusers pipeline object. I.e. in this case it would be: This would need changes in ComfyUI code. E.g. you can find in the code where that prefix is handled specially for Flux so it can load plain Diffusers-trained loras. There's already a couple people starting to make HunyuanVideo loras so we should resolve the format soon. |
Yes for example CogVideoX -loras are |
@tdrussell thanks a lot for all this, I merged with the changes I mentioned, I can adjust the prefix etc. in the future if needed |
It would be great to see an updated README based on LoRA support |
hi I see you've updated it to use LoRa, but here's what I noticed, according to the new workflow, RAM consumption is just off the scale...personally, everything goes into space for me. In the last workflow, everything was straight smooth... |
The scale being what? Loading the LLM without quantization along with the model always has taken a lot of RAM, the nodes have options to juggle what is offloaded and what is kept in VRAM that you can use to balance it for your system, overall RAM usage has not increased. To reduce overall RAM usage the best thing is to use the bnb_nf4 quantization, it has very minimal effect on quality, if at all. |
Would it be possible to have another example JSON that shows how LoRA would fit in the ComfyUI workflow? I'm very new to this landscape and I'm not sure what blocks I need to add given what I've seen in this conversation. Is it as simple as a LoRA loader after loading in the diffusion model? |
In the current main version you simply add a LoRa select node to the model loader node, it has a LoRA input now. |
@kijai Thank you so much for your reply! |
Sorry for the newby question, and thanks @kijai for your amazing work. How would you train a Lora for Hunyuan and would it be Flux or Stable Diffusion? |
There is a discussion thread here. The OP of the PR merged the LoRA node in here and referenced his training repo to train LoRA to plug in to HunyuanVideo but there is no official README about it on this repo. As for the actual architecture LoRA is applied to, it supports Flux, Stable Diffusion and HunyuanVideo. The LoRA being trained in question is for HunyuanVideo. |
This adds LoRA support. I have a training script that can train loras, and I will open source the code in the next few days. This gets the lora support ready. I've already tested it and can confirm that it works as expected with the test lora I've trained.
That being said, I'm not familiar with ComfyUI internals AT ALL. If this change is doing something wrong, please let me know.
I had to remove the mm.unload_all_models() call from the sampler. The transformer is now a normal ComfyUI model, so it would get offloaded to CPU before the denoising loop without that change. I don't know why it was needed in the first place, since when force_offload is set, it appears that every model is moved back to CPU when it's done being used.