Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load layer weights on Comfyui node #13

Open
junsukha opened this issue Dec 17, 2024 · 5 comments
Open

How to load layer weights on Comfyui node #13

junsukha opened this issue Dec 17, 2024 · 5 comments

Comments

@junsukha
Copy link

junsukha commented Dec 17, 2024

Thanks for sharing your work.

I don't know how to go about loading LoRA model on "Hunyuanvideo lora select" node.
Following is the weights I have after training LoRA with your script.
image

Are "layer_xx-model_states.pt" all different weights? Am I not supposed to use all layer weights together?

image

@tdrussell
Copy link
Owner

There should be directories named after the epoch, e.g. "epoch7", that have a safetensors file in them which is the lora. The files in the screenshot are deepspeed internal checkpoint files. The save frequency is controlled by the "save_every_n_epochs" config field which is 2 in the example config file.

@junsukha
Copy link
Author

junsukha commented Dec 17, 2024

@tdrussell
oh no. ok.

# Probably want to set this a bit higher if you have a smaller dataset so you don't end up with a million saved models.
save_every_n_epochs = 10 # 2
# Can checkpoint the training state every n number of epochs or minutes. Set only one of these. You can resume from checkpoints using the --resume_from_checkpoint flag.
#checkpoint_every_n_epochs = 1
checkpoint_every_n_minutes = 30

so checkpoint_every_n_minutes or checkpoint_every_n_epochs neither saves model?

The config I used was

save_every_n_epochs = 10
checkpoint_every_n_minutes = 30

# dataset.toml
num_repeats = 10

Perhaps that's why it didn't save any directories named after the epoch as you said because num_repeats increased the time to run one epoch and I even set save_every_n_epochs = 10?

@tdrussell
Copy link
Owner

The checkpointing settings are only for deepspeed checkpoints which is this per-layer thing in your screenshots. And yes, probably you didn't train long enough for it to save even once. The num_repeats works like it does in Kohya sd-scripts, the items in your dataset are logically being repeated that many times, so one epoch takes longer.

You can still "salvage" the training run by setting save_every_n_epochs to 1, then resuming using --resume_from_checkpoint (since you have deepspeed checkpoints), and then waiting for 1 epoch. Or even go into the train.py code and put a "saver.save_model('some_name')" right before the training loop to save immediately.

@junsukha
Copy link
Author

@tdrussell
Thx!
Works like a charm.
BTW, one thing to note is that hunyuan_video.toml file saved in each epochxx folder looks wrong.
Haven't check all the variables but at least checkpoint_every_n_epochs and rank are set to different values from what I've used.
image

hunyuan_video.toml under date (starting with 2024) folder is correct but not the one under each epochx folder.

@tdrussell
Copy link
Owner

It will copy the current config file, whatever it is, at the moment it saves. Did you change the config file after starting training?

Probably it should be changed to read all the file bytes at training startup and just write that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants