Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi, can you provide some example data of hunyuanvideo lora training data? #6

Open
jax-explorer opened this issue Dec 13, 2024 · 14 comments

Comments

@jax-explorer
Copy link

Hi, can you provide some example data of hunyuanvideo lora training data?

@jordoh
Copy link

jordoh commented Dec 15, 2024

Not affiliated with this repo, but I've been somewhat successful training using a typical flux person lora training dataset: 36 1024x1024 images with joycaption alpha 2 captions, using a unique descriptor for the subject. Likeness in any given video output is fairly hit-or-miss; a prompt similar to a training image caption produces pretty good likeness, with Facenet512 evaluation of face matches as high as 92% of images in a validation set (which includes images not included in the training data).

So far I've tried, with all other values matching the defaults in the example configurations:

  • LR 2e-5 (default in example config): best face match at 900 steps (60% of validation images matching); with no improvement measuring periodically all the way to 3,600 steps.

  • LR 6e-5: best face match (62%) reached at 360 steps, no further improvement through 900 steps

  • LR 1e-4: best face match (92%) reached at 900 steps, match percent reduces then increases back to (almost) same level ~1,000 steps later

Per the repo readme, captions are in txt files with matching basenames:
image

Some random observations:

  • It's unclear why 36 images with 10 repeats set in dataset.toml results in 90 steps per epoch
  • Lora is applied at 1.0 strength. 1.2 strength consistently reduces Facenet512 face match
  • With randomly generated prompts, the best resulting lora produces a subjective face match about 20% of the time, while ~75% of outputs register a (Facenet512) face match against at least one validation image (but fall short of subjective assessment of likeness).

@jax-explorer
Copy link
Author

jax-explorer commented Dec 16, 2024

Not affiliated with this repo, but I've been somewhat successful training using a typical flux person lora training dataset: 36 1024x1024 images with joycaption alpha 2 captions, using a unique descriptor for the subject. Likeness in any given video output is fairly hit-or-miss; a prompt similar to a training image caption produces pretty good likeness, with Facenet512 evaluation of face matches as high as 92% of images in a validation set (which includes images not included in the training data).

So far I've tried, with all other values matching the defaults in the example configurations:

  • LR 2e-5 (default in example config): best face match at 900 steps (60% of validation images matching); with no improvement measuring periodically all the way to 3,600 steps.
  • LR 6e-5: best face match (62%) reached at 360 steps, no further improvement through 900 steps
  • LR 1e-4: best face match (92%) reached at 900 steps, match percent reduces then increases back to (almost) same level ~1,000 steps later

Per the repo readme, captions are in txt files with matching basenames: image

Some random observations:

  • It's unclear why 36 images with 10 repeats set in dataset.toml results in 90 steps per epoch
  • Lora is applied at 1.0 strength. 1.2 strength consistently reduces Facenet512 face match
  • With randomly generated prompts, the best resulting lora produces a subjective face match about 20% of the time, while ~75% of outputs register a (Facenet512) face match against at least one validation image (but fall short of subjective assessment of likeness).

@jordoh So the recommendation is:
Use 1024 x 1024 x 36 images
Learning rate LR 1e-4
Training 1000 steps

Is that right?

@jordoh
Copy link

jordoh commented Dec 16, 2024

@jordoh So the recommendation is: Use 1024 x 1024 x 36 images Learning rate LR 1e-4 Training 1000 steps

Is that right?

If you are trying to train a person's likeness (and not a style or camera motion, etc), I have had success with those settings, yes.

Training now with 50 720x540 videos, 30-80 frames per video, joycaption alpha 2 captions of first frame (manually adjusted to not use phrases like "a photo of ..."), it died (unclear exception) at step 9 on first attempt, currently at step 13 of second attempt. ~45 GB vram usage, but - to answer one of my observations/questions in previous comment - the default batch size is 4 (36 images * 10 repeats previously / batch size 4 = 90 steps per epoch; now 50 videos / batch size 4 = 10 steps per epoch - not sure how that maths 🤷), in theory setting batch_size = 2 in examples/hunyuan_video.toml would affect that and reduce vram usage.

@jax-explorer
Copy link
Author

@jordoh So the recommendation is: Use 1024 x 1024 x 36 images Learning rate LR 1e-4 Training 1000 steps
Is that right?

If you are trying to train a person's likeness (and not a style or camera motion, etc), I have had success with those settings, yes.

Training now with 50 720x540 videos, 30-80 frames per video, joycaption alpha 2 captions of first frame (manually adjusted to not use phrases like "a photo of ..."), it died (unclear exception) at step 9 on first attempt, currently at step 13 of second attempt. ~45 GB vram usage, but - to answer one of my observations/questions in previous comment - the default batch size is 4 (36 images * 10 repeats previously / batch size 4 = 90 steps per epoch; now 50 videos / batch size 4 = 10 steps per epoch - not sure how that maths 🤷), in theory setting batch_size = 2 in examples/hunyuan_video.toml would affect that and reduce vram usage.

@jordoh Yes, I was preparing to train for the character

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

@jordoh
Copy link

jordoh commented Dec 16, 2024

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

I'm using iPhone live photos, as they generally capture speaking and other natural movement (smiling, etc), and have an aspect ratio that can scale down to what seems to be dimensions hunyuan works well at (720x540).

A note on the batch size: it's dictated by gradient accumulation steps - setting that to 1 reduces batch size to 1 as well, but vram usage is still pretty high at ~42GB, so not much savings there.

@zeldapkmn
Copy link

By the way What scene are you training in below using the video? Because I see you mentioned above that you can already get good results using pictures for training.

I'm using iPhone live photos, as they generally capture speaking and other natural movement (smiling, etc), and have an aspect ratio that can scale down to what seems to be dimensions hunyuan works well at (720x540).

A note on the batch size: it's dictated by gradient accumulation steps - setting that to 1 reduces batch size to 1 as well, but vram usage is still pretty high at ~42GB, so not much savings there.

Hey,

I keep getting this error after very few steps:

"[rank0]: RuntimeError: CUDA error: unknown error
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-12-16 01:02:59,531] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1042
[2024-12-16 01:02:59,581] [ERROR] [launch.py:325:sigkill_handler] ['/opt/fsl/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--config', 'examples/hunyuan_video.toml'] exits with return code = 1".

4090 24GB VRAM, 15 dataset images (non-resized, assuming aspect ratio bucketing will handle), set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2, LORA rank 16.

Step time is highly variable (ex. 1,2,4 take a few seconds, but 3 and 5 take 30-45 minutes). Any ideas?

@jax-explorer
Copy link
Author

jax-explorer commented Dec 17, 2024

One more question, how are the trigger words set up here? Is it the same as flux, just include the word in the caption file.

@tdrussell
Copy link
Owner

tdrussell commented Dec 17, 2024

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.

Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

@comfyonline
Copy link

comfyonline commented Dec 17, 2024

I use the following configuration:
#I usually set this to a really high value because I don't know how long I want to train.
epochs = 1000
#Batch size of a single forward/backward pass for one GPU.
micro_batch_size_per_gpu = 4
#Pipeline parallelism degree. A single instance of the model is divided across this many GPUs.
pipeline_stages = 1
#Number of micro-batches sent through the pipeline for each training step.
#If pipeline_stages > 1, a higher GAS means better GPU utilization due to smaller pipeline bubbles (where GPUs aren't overlapping computation).
gradient_accumulation_steps = 4
#Grad norm clipping.
gradient_clipping = 1.0
#Learning rate warmup.
warmup_steps = 100

train 1024x1024 36 images, Each step takes nearly 1 min, using L40 training. Very slow.
@jordoh @tdrussell May I ask if your speeds are normal?

@zeldapkmn
Copy link

zeldapkmn commented Dec 17, 2024

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.

Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

Yep, working now after resizing heights to 1024 max

As a side note, have you got TorchCompile running successfully in Hunyuan?

@wwwffbf
Copy link

wwwffbf commented Dec 17, 2024

set 2048 as the resolution instead of 512 (since this is the highest res), batch size=2

I doubt this fits in 24GB of VRAM. Your images are 16x larger in area than 512x512, plus you're using batch size 2 instead of 1. Are you using WSL? Doesn't windows have the weird thing where it swaps VRAM to RAM automatically? Does that even get enabled inside WSL? I don't know. That might be why steps suddenly take extremely long to complete.
Try training on 512 res with batch size 1 to start with. HunyuanVideo isn't even pretrained with super high resolutions like 2048, so that might not even work right if you could run it.

Yep, working now after resizing heights to 1024 max

As a side note, have you got TorchCompile running successfully in Hunyuan?

triton worked, using the latest comfyui version

@jordoh
Copy link

jordoh commented Dec 17, 2024

One more question, how are the trigger words set up here? Is it the same as flux, just include the word in the caption file.

I've trained using joycaption generated captions that include a unique trigger word; haven't tried with just the trigger word. For videos, I'm using a joycaption generated caption of the first frame including a unique trigger word.

@jordoh @tdrussell May I ask if your speeds are normal?

On an A40, I was seeing 30 minutes per epoch of 360 images (36 images x 10 repeats) with batch size 4 (so reported as 90 steps). Somewhere around 18 seconds per step (each step a 4 image batch).

@comfyonline
Copy link

@jordoh Thank you very much for the answer.

@comfyonline
Copy link

@jordoh It looks like I changed the batch_size to 16=4 x 4 and the speed went down, the L40 48GB is supposed to be a lot faster than the A40, but for the same 1024x1024 x 36, it took me 240min, you took almost 300min, and we all ended up with almost 10 epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants