How to load layer weights on Comfyui node #13

junsukha · 2024-12-17T01:50:04Z

Thanks for sharing your work.

I don't know how to go about loading LoRA model on "Hunyuanvideo lora select" node.
Following is the weights I have after training LoRA with your script.

Are "layer_xx-model_states.pt" all different weights? Am I not supposed to use all layer weights together?

tdrussell · 2024-12-17T04:01:02Z

There should be directories named after the epoch, e.g. "epoch7", that have a safetensors file in them which is the lora. The files in the screenshot are deepspeed internal checkpoint files. The save frequency is controlled by the "save_every_n_epochs" config field which is 2 in the example config file.

junsukha · 2024-12-17T04:38:27Z

@tdrussell
oh no. ok.

# Probably want to set this a bit higher if you have a smaller dataset so you don't end up with a million saved models.
save_every_n_epochs = 10 # 2
# Can checkpoint the training state every n number of epochs or minutes. Set only one of these. You can resume from checkpoints using the --resume_from_checkpoint flag.
#checkpoint_every_n_epochs = 1
checkpoint_every_n_minutes = 30

so checkpoint_every_n_minutes or checkpoint_every_n_epochs neither saves model?

The config I used was

save_every_n_epochs = 10
checkpoint_every_n_minutes = 30

# dataset.toml
num_repeats = 10

Perhaps that's why it didn't save any directories named after the epoch as you said because num_repeats increased the time to run one epoch and I even set save_every_n_epochs = 10?

tdrussell · 2024-12-17T06:24:25Z

The checkpointing settings are only for deepspeed checkpoints which is this per-layer thing in your screenshots. And yes, probably you didn't train long enough for it to save even once. The num_repeats works like it does in Kohya sd-scripts, the items in your dataset are logically being repeated that many times, so one epoch takes longer.

You can still "salvage" the training run by setting save_every_n_epochs to 1, then resuming using --resume_from_checkpoint (since you have deepspeed checkpoints), and then waiting for 1 epoch. Or even go into the train.py code and put a "saver.save_model('some_name')" right before the training loop to save immediately.

junsukha · 2024-12-17T09:24:33Z

@tdrussell
Thx!
Works like a charm.
BTW, one thing to note is that hunyuan_video.toml file saved in each epochxx folder looks wrong.
Haven't check all the variables but at least checkpoint_every_n_epochs and rank are set to different values from what I've used.

hunyuan_video.toml under date (starting with 2024) folder is correct but not the one under each epochx folder.

tdrussell · 2024-12-18T03:51:46Z

It will copy the current config file, whatever it is, at the moment it saves. Did you change the config file after starting training?

Probably it should be changed to read all the file bytes at training startup and just write that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load layer weights on Comfyui node #13

How to load layer weights on Comfyui node #13

junsukha commented Dec 17, 2024 •

edited

Loading

tdrussell commented Dec 17, 2024

junsukha commented Dec 17, 2024 •

edited

Loading

tdrussell commented Dec 17, 2024

junsukha commented Dec 17, 2024

tdrussell commented Dec 18, 2024

How to load layer weights on Comfyui node #13

How to load layer weights on Comfyui node #13

Comments

junsukha commented Dec 17, 2024 • edited Loading

tdrussell commented Dec 17, 2024

junsukha commented Dec 17, 2024 • edited Loading

tdrussell commented Dec 17, 2024

junsukha commented Dec 17, 2024

tdrussell commented Dec 18, 2024

junsukha commented Dec 17, 2024 •

edited

Loading

junsukha commented Dec 17, 2024 •

edited

Loading