Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended Prodigy Settings (low steps per epoch) #8

Closed
brandostrong opened this issue Nov 7, 2023 · 9 comments
Closed

Recommended Prodigy Settings (low steps per epoch) #8

brandostrong opened this issue Nov 7, 2023 · 9 comments

Comments

@brandostrong
Copy link

Hi, I'm using Prodigy to train stable diffusion loras and I'm amazed at how resistant it is to overtraining, but have had a hard time nailing the final 10-20% I need to properly say it's trained. I have a few questions about schedulers and and their args.

  1. I'm seeing conflicting recommendations of setting T_Max for cosine annealing to steps per epoch, or total steps for the entire run
  2. Is the cosine with restarts scheduler aware of how many total steps the train will be doing(as in does setting max steps affect earlytraining?) outside of t_max
  3. If I were to try a different scheduler which would you recommend?
  4. Assuming I have a small dataset, would you recommend sticking to low repeats / no regularization images to let the optimizer adjust, or should I try to keep the step count per epoch high?

Thank you, I'd appreciate any advice if you're familiar with my questions.

@konstmish
Copy link
Owner

Hi, I'm glad you had a positive experience with Prodigy!

  1. We recommend having no restarts, which corresponds to setting T_max in cosine annealing to the total number of steps the scheduler.step() is called. In most cases, this is the total number of steps for the entire run. If you decide to use restarts, make sure to set safeguard_warmup=True, which sometimes prevents the estimated learning rate from blowing up.
  2. Cosine annealing with restarts is usually agnostic to the total number of steps.
  3. I'd suggest trying PolynomialLR with power=1.0 (default value), it seems to be quite helpful when using Adam.
  4. Unfortunately, we haven't run any experiments of this kind, so it'd be actually helpful to us if you share your experience.

I hope this helps, but let us know if you have any other questions.

@brandostrong
Copy link
Author

Thank you for your reply. For # 4 it seems in my testing for a diverse set there's no need, but for a small set there is still benefits to repeats, but less so than other less dynamic optimizers. I'll check out polynomial vs T_max=steps today.

@madman404
Copy link

I'm going to assume for the sake of this post that you are training LoRA within Kohya, since that is the most common.

As it pertains to your concerns about T_max: if you use the "cosine" scheduler setting in Kohya, all of that is handled for you. You don't need to pass it any additional arguments, it'll do the math and set the LR schedule appropriately.

On the matter of repeats: You don't need them unless you're training multiple concepts or using regularization images. Without that, one repeat is going to be largely equivalent to one epoch. For better ease of tracking your training, it's best not to use repeats at all.

On the matter of "nailing the final 10-20%": I was dealing with a dataset that had similar resistance to learning the finer details as well. Full disclosure, I am not technically competent enough to anticipate if this has non-obvious adverse affects on the behavior of prodigy, but what worked for me in getting the learning process to conclude properly (where adjusting d_coef, network dimensions, all manners of step counts, schedulers, and dataset images failed) was actually to adjust beta2 and the weight decay.

In particular, the parameters that worked for me were betas of (0.9, 0.99), weight_decay of .1, and batch size of 5 over about 1000-2000 steps. Everything else was left to defaults. I did not enable bias correction, but I am not speaking affirmatively or negatively on that because I haven't gone back to test that yet. If you are still struggling, you may want to give these settings a try and see what works for you. I suspect the most important parts were the lowered beta2 (which, as far as I can tell, should improve "remembering" details from previous steps) and raised weight decay.

I'm sorry if this kind of discussion is not suited for the issues page of the optimizer, but I hope my personal observations on training diffusion models may help.

@brandostrong
Copy link
Author

(0.9, 0.99)

Wow. Your betas suggestion is a dramatic improvement, thank you.

@konstmish
Copy link
Owner

I'm sorry if this kind of discussion is not suited for the issues page of the optimizer, but I hope my personal observations on training diffusion models may help.
Not at all, your observations are very welcome here.

@umarbutler
Copy link

  1. We recommend having no restarts, which corresponds to setting T_max in cosine annealing to the total number of steps the scheduler.step() is called. In most cases, this is the total number of steps for the entire run. If you decide to use restarts, make sure to set safeguard_warmup=True, which sometimes prevents the estimated learning rate from blowing up.
  2. Cosine annealing with restarts is usually agnostic to the total number of steps.
  3. I'd suggest trying PolynomialLR with power=1.0 (default value), it seems to be quite helpful when using Adam.

@konstmish Might I suggest including some of this information in your README? It answers a lot of questions I had that the README wasn't able to answer for me.

@konstmish
Copy link
Owner

@umarbutler
thanks for the suggestions, I updated the readme, I hope it is more helpful now

@askerlee
Copy link

askerlee commented Jan 4, 2024

Hi @madman404 , thanks for your important tip! Have you tried to use (0.9, 0.99) as betas for AdamW as well? Thanks.

@phageous
Copy link

phageous commented Nov 9, 2024

Adding a data point: when training anime character LoRA on Flux1.dev, setting betas to (0.9, 0.99) makes Prodigy NOT able to learn some characters' hair color. This is especially true if the character has lighter hair color such as light blue or grey. It does seem to learn other attributes better though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants