Hi, can you provide some example data of hunyuanvideo lora training data? #6

jax-explorer · 2024-12-13T06:32:39Z

Hi, can you provide some example data of hunyuanvideo lora training data?

jordoh · 2024-12-15T19:22:19Z

Not affiliated with this repo, but I've been somewhat successful training using a typical flux person lora training dataset: 36 1024x1024 images with joycaption alpha 2 captions, using a unique descriptor for the subject. Likeness in any given video output is fairly hit-or-miss; a prompt similar to a training image caption produces pretty good likeness, with Facenet512 evaluation of face matches as high as 92% of images in a validation set (which includes images not included in the training data).

So far I've tried, with all other values matching the defaults in the example configurations:

LR 2e-5 (default in example config): best face match at 900 steps (60% of validation images matching); with no improvement measuring periodically all the way to 3,600 steps.
LR 6e-5: best face match (62%) reached at 360 steps, no further improvement through 900 steps
LR 1e-4: best face match (92%) reached at 900 steps, match percent reduces then increases back to (almost) same level ~1,000 steps later

Per the repo readme, captions are in txt files with matching basenames:

Some random observations:

It's unclear why 36 images with 10 repeats set in dataset.toml results in 90 steps per epoch
Lora is applied at 1.0 strength. 1.2 strength consistently reduces Facenet512 face match
With randomly generated prompts, the best resulting lora produces a subjective face match about 20% of the time, while ~75% of outputs register a (Facenet512) face match against at least one validation image (but fall short of subjective assessment of likeness).

jax-explorer · 2024-12-16T03:31:35Z

Not affiliated with this repo, but I've been somewhat successful training using a typical flux person lora training dataset: 36 1024x1024 images with joycaption alpha 2 captions, using a unique descriptor for the subject. Likeness in any given video output is fairly hit-or-miss; a prompt similar to a training image caption produces pretty good likeness, with Facenet512 evaluation of face matches as high as 92% of images in a validation set (which includes images not included in the training data).

So far I've tried, with all other values matching the defaults in the example configurations:

LR 2e-5 (default in example config): best face match at 900 steps (60% of validation images matching); with no improvement measuring periodically all the way to 3,600 steps.

LR 6e-5: best face match (62%) reached at 360 steps, no further improvement through 900 steps

LR 1e-4: best face match (92%) reached at 900 steps, match percent reduces then increases back to (almost) same level ~1,000 steps later

Per the repo readme, captions are in txt files with matching basenames:

Some random observations:

It's unclear why 36 images with 10 repeats set in dataset.toml results in 90 steps per epoch

Lora is applied at 1.0 strength. 1.2 strength consistently reduces Facenet512 face match

With randomly generated prompts, the best resulting lora produces a subjective face match about 20% of the time, while ~75% of outputs register a (Facenet512) face match against at least one validation image (but fall short of subjective assessment of likeness).

@jordoh So the recommendation is:
Use 1024 x 1024 x 36 images
Learning rate LR 1e-4
Training 1000 steps

Is that right?

jordoh · 2024-12-16T03:57:03Z

@jordoh So the recommendation is: Use 1024 x 1024 x 36 images Learning rate LR 1e-4 Training 1000 steps

Is that right?

If you are trying to train a person's likeness (and not a style or camera motion, etc), I have had success with those settings, yes.

Training now with 50 720x540 videos, 30-80 frames per video, joycaption alpha 2 captions of first frame (manually adjusted to not use phrases like "a photo of ..."), it died (unclear exception) at step 9 on first attempt, currently at step 13 of second attempt. ~45 GB vram usage, but - to answer one of my observations/questions in previous comment - the default batch size is 4 (36 images * 10 repeats previously / batch size 4 = 90 steps per epoch; now 50 videos / batch size 4 = 10 steps per epoch - not sure how that maths 🤷), in theory setting batch_size = 2 in examples/hunyuan_video.toml would affect that and reduce vram usage.

jax-explorer · 2024-12-16T04:02:55Z

@jordoh So the recommendation is: Use 1024 x 1024 x 36 images Learning rate LR 1e-4 Training 1000 steps
Is that right?

If you are trying to train a person's likeness (and not a style or camera motion, etc), I have had success with those settings, yes.

Training now with 50 720x540 videos, 30-80 frames per video, joycaption alpha 2 captions of first frame (manually adjusted to not use phrases like "a photo of ..."), it died (unclear exception) at step 9 on first attempt, currently at step 13 of second attempt. ~45 GB vram usage, but - to answer one of my observations/questions in previous comment - the default batch size is 4 (36 images * 10 repeats previously / batch size 4 = 90 steps per epoch; now 50 videos / batch size 4 = 10 steps per epoch - not sure how that maths 🤷), in theory setting batch_size = 2 in examples/hunyuan_video.toml would affect that and reduce vram usage.

@jordoh Yes, I was preparing to train for the character

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

jordoh · 2024-12-16T15:17:23Z

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

I'm using iPhone live photos, as they generally capture speaking and other natural movement (smiling, etc), and have an aspect ratio that can scale down to what seems to be dimensions hunyuan works well at (720x540).

A note on the batch size: it's dictated by gradient accumulation steps - setting that to 1 reduces batch size to 1 as well, but vram usage is still pretty high at ~42GB, so not much savings there.

zeldapkmn · 2024-12-16T20:42:51Z

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

I'm using iPhone live photos, as they generally capture speaking and other natural movement (smiling, etc), and have an aspect ratio that can scale down to what seems to be dimensions hunyuan works well at (720x540).

A note on the batch size: it's dictated by gradient accumulation steps - setting that to 1 reduces batch size to 1 as well, but vram usage is still pretty high at ~42GB, so not much savings there.

Hey,

I keep getting this error after very few steps:

"[rank0]: RuntimeError: CUDA error: unknown error
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-12-16 01:02:59,531] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1042
[2024-12-16 01:02:59,581] [ERROR] [launch.py:325:sigkill_handler] ['/opt/fsl/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--config', 'examples/hunyuan_video.toml'] exits with return code = 1".

4090 24GB VRAM, 15 dataset images (non-resized, assuming aspect ratio bucketing will handle), set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2, LORA rank 16.

Step time is highly variable (ex. 1,2,4 take a few seconds, but 3 and 5 take 30-45 minutes). Any ideas?

jax-explorer · 2024-12-17T01:33:22Z

One more question, how are the trigger words set up here? Is it the same as flux, just include the word in the caption file.

tdrussell · 2024-12-17T06:30:16Z

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.

Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

comfyonline · 2024-12-17T06:35:08Z

I use the following configuration:
#I usually set this to a really high value because I don't know how long I want to train.
epochs = 1000
#Batch size of a single forward/backward pass for one GPU.
micro_batch_size_per_gpu = 4
#Pipeline parallelism degree. A single instance of the model is divided across this many GPUs.
pipeline_stages = 1
#Number of micro-batches sent through the pipeline for each training step.
#If pipeline_stages > 1, a higher GAS means better GPU utilization due to smaller pipeline bubbles (where GPUs aren't overlapping computation).
gradient_accumulation_steps = 4
#Grad norm clipping.
gradient_clipping = 1.0
#Learning rate warmup.
warmup_steps = 100

train 1024x1024 36 images, Each step takes nearly 1 min, using L40 training. Very slow.
@jordoh @tdrussell May I ask if your speeds are normal?

zeldapkmn · 2024-12-17T07:56:08Z

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.

Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

Yep, working now after resizing heights to 1024 max

As a side note, have you got TorchCompile running successfully in Hunyuan?

wwwffbf · 2024-12-17T11:47:37Z

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.
Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

Yep, working now after resizing heights to 1024 max

As a side note, have you got TorchCompile running successfully in Hunyuan?

triton worked, using the latest comfyui version

jordoh · 2024-12-17T14:39:15Z

One more question, how are the trigger words set up here? Is it the same as flux, just include the word in the caption file.

I've trained using joycaption generated captions that include a unique trigger word; haven't tried with just the trigger word. For videos, I'm using a joycaption generated caption of the first frame including a unique trigger word.

@jordoh @tdrussell May I ask if your speeds are normal?

On an A40, I was seeing 30 minutes per epoch of 360 images (36 images x 10 repeats) with batch size 4 (so reported as 90 steps). Somewhere around 18 seconds per step (each step a 4 image batch).

comfyonline · 2024-12-17T14:46:05Z

@jordoh Thank you very much for the answer.

comfyonline · 2024-12-17T14:55:53Z

@jordoh It looks like I changed the batch_size to 16=4 x 4 and the speed went down, the L40 48GB is supposed to be a lot faster than the A40, but for the same 1024x1024 x 36, it took me 240min, you took almost 300min, and we all ended up with almost 10 epochs.

rayryeng mentioned this issue Dec 14, 2024

Lora 🟩 Tips, discussion and feedbacks kijai/ComfyUI-HunyuanVideoWrapper#136

Open

rayryeng mentioned this issue Dec 25, 2024

LoRA support kijai/ComfyUI-HunyuanVideoWrapper#72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi, can you provide some example data of hunyuanvideo lora training data? #6

Hi, can you provide some example data of hunyuanvideo lora training data? #6

jax-explorer commented Dec 13, 2024

jordoh commented Dec 15, 2024

jax-explorer commented Dec 16, 2024 •

edited

Loading

jordoh commented Dec 16, 2024

jax-explorer commented Dec 16, 2024

jordoh commented Dec 16, 2024

zeldapkmn commented Dec 16, 2024

jax-explorer commented Dec 17, 2024 •

edited

Loading

tdrussell commented Dec 17, 2024 •

edited

Loading

comfyonline commented Dec 17, 2024 •

edited

Loading

zeldapkmn commented Dec 17, 2024 •

edited

Loading

wwwffbf commented Dec 17, 2024

jordoh commented Dec 17, 2024

comfyonline commented Dec 17, 2024

comfyonline commented Dec 17, 2024

Hi, can you provide some example data of hunyuanvideo lora training data? #6

Hi, can you provide some example data of hunyuanvideo lora training data? #6

Comments

jax-explorer commented Dec 13, 2024

jordoh commented Dec 15, 2024

jax-explorer commented Dec 16, 2024 • edited Loading

jordoh commented Dec 16, 2024

jax-explorer commented Dec 16, 2024

jordoh commented Dec 16, 2024

zeldapkmn commented Dec 16, 2024

jax-explorer commented Dec 17, 2024 • edited Loading

tdrussell commented Dec 17, 2024 • edited Loading

comfyonline commented Dec 17, 2024 • edited Loading

zeldapkmn commented Dec 17, 2024 • edited Loading

wwwffbf commented Dec 17, 2024

jordoh commented Dec 17, 2024

comfyonline commented Dec 17, 2024

comfyonline commented Dec 17, 2024

jax-explorer commented Dec 16, 2024 •

edited

Loading

jax-explorer commented Dec 17, 2024 •

edited

Loading

tdrussell commented Dec 17, 2024 •

edited

Loading

comfyonline commented Dec 17, 2024 •

edited

Loading

zeldapkmn commented Dec 17, 2024 •

edited

Loading