Recommended Prodigy Settings (low steps per epoch) #8

brandostrong · 2023-11-07T01:57:36Z

Hi, I'm using Prodigy to train stable diffusion loras and I'm amazed at how resistant it is to overtraining, but have had a hard time nailing the final 10-20% I need to properly say it's trained. I have a few questions about schedulers and and their args.

I'm seeing conflicting recommendations of setting T_Max for cosine annealing to steps per epoch, or total steps for the entire run
Is the cosine with restarts scheduler aware of how many total steps the train will be doing(as in does setting max steps affect earlytraining?) outside of t_max
If I were to try a different scheduler which would you recommend?
Assuming I have a small dataset, would you recommend sticking to low repeats / no regularization images to let the optimizer adjust, or should I try to keep the step count per epoch high?

Thank you, I'd appreciate any advice if you're familiar with my questions.

konstmish · 2023-11-07T18:05:33Z

Hi, I'm glad you had a positive experience with Prodigy!

We recommend having no restarts, which corresponds to setting T_max in cosine annealing to the total number of steps the scheduler.step() is called. In most cases, this is the total number of steps for the entire run. If you decide to use restarts, make sure to set safeguard_warmup=True, which sometimes prevents the estimated learning rate from blowing up.
Cosine annealing with restarts is usually agnostic to the total number of steps.
I'd suggest trying PolynomialLR with power=1.0 (default value), it seems to be quite helpful when using Adam.
Unfortunately, we haven't run any experiments of this kind, so it'd be actually helpful to us if you share your experience.

I hope this helps, but let us know if you have any other questions.

brandostrong · 2023-11-07T21:21:41Z

Thank you for your reply. For # 4 it seems in my testing for a diverse set there's no need, but for a small set there is still benefits to repeats, but less so than other less dynamic optimizers. I'll check out polynomial vs T_max=steps today.

madman404 · 2023-11-09T19:11:31Z

I'm going to assume for the sake of this post that you are training LoRA within Kohya, since that is the most common.

As it pertains to your concerns about T_max: if you use the "cosine" scheduler setting in Kohya, all of that is handled for you. You don't need to pass it any additional arguments, it'll do the math and set the LR schedule appropriately.

On the matter of repeats: You don't need them unless you're training multiple concepts or using regularization images. Without that, one repeat is going to be largely equivalent to one epoch. For better ease of tracking your training, it's best not to use repeats at all.

On the matter of "nailing the final 10-20%": I was dealing with a dataset that had similar resistance to learning the finer details as well. Full disclosure, I am not technically competent enough to anticipate if this has non-obvious adverse affects on the behavior of prodigy, but what worked for me in getting the learning process to conclude properly (where adjusting d_coef, network dimensions, all manners of step counts, schedulers, and dataset images failed) was actually to adjust beta2 and the weight decay.

In particular, the parameters that worked for me were betas of (0.9, 0.99), weight_decay of .1, and batch size of 5 over about 1000-2000 steps. Everything else was left to defaults. I did not enable bias correction, but I am not speaking affirmatively or negatively on that because I haven't gone back to test that yet. If you are still struggling, you may want to give these settings a try and see what works for you. I suspect the most important parts were the lowered beta2 (which, as far as I can tell, should improve "remembering" details from previous steps) and raised weight decay.

I'm sorry if this kind of discussion is not suited for the issues page of the optimizer, but I hope my personal observations on training diffusion models may help.

brandostrong · 2023-11-09T23:52:22Z

(0.9, 0.99)

Wow. Your betas suggestion is a dramatic improvement, thank you.

konstmish · 2023-11-12T17:23:52Z

I'm sorry if this kind of discussion is not suited for the issues page of the optimizer, but I hope my personal observations on training diffusion models may help.
Not at all, your observations are very welcome here.

umarbutler · 2023-11-29T08:59:18Z

We recommend having no restarts, which corresponds to setting T_max in cosine annealing to the total number of steps the scheduler.step() is called. In most cases, this is the total number of steps for the entire run. If you decide to use restarts, make sure to set safeguard_warmup=True, which sometimes prevents the estimated learning rate from blowing up.

Cosine annealing with restarts is usually agnostic to the total number of steps.

I'd suggest trying PolynomialLR with power=1.0 (default value), it seems to be quite helpful when using Adam.

@konstmish Might I suggest including some of this information in your README? It answers a lot of questions I had that the README wasn't able to answer for me.

konstmish · 2023-11-29T15:07:03Z

@umarbutler
thanks for the suggestions, I updated the readme, I hope it is more helpful now

askerlee · 2024-01-04T05:39:39Z

Hi @madman404 , thanks for your important tip! Have you tried to use (0.9, 0.99) as betas for AdamW as well? Thanks.

phageous · 2024-11-09T21:41:18Z

Adding a data point: when training anime character LoRA on Flux1.dev, setting betas to (0.9, 0.99) makes Prodigy NOT able to learn some characters' hair color. This is especially true if the character has lighter hair color such as light blue or grey. It does seem to learn other attributes better though.

konstmish closed this as completed Nov 17, 2023

konstmish mentioned this issue Nov 21, 2023

Add features to the Dreambooth LoRA SDXL training script huggingface/diffusers#5508

Merged

dxqbYD mentioned this issue Oct 31, 2024

more VRAM savings: no first moment and factored second moment #25

Open

LoganBooker mentioned this issue Nov 5, 2024

Prodigy is not working well with Stable Diffusion 3.5 Medium training #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended Prodigy Settings (low steps per epoch) #8

Recommended Prodigy Settings (low steps per epoch) #8

brandostrong commented Nov 7, 2023

konstmish commented Nov 7, 2023

brandostrong commented Nov 7, 2023

madman404 commented Nov 9, 2023

brandostrong commented Nov 9, 2023

konstmish commented Nov 12, 2023

umarbutler commented Nov 29, 2023

konstmish commented Nov 29, 2023

askerlee commented Jan 4, 2024

phageous commented Nov 9, 2024

Recommended Prodigy Settings (low steps per epoch) #8

Recommended Prodigy Settings (low steps per epoch) #8

Comments

brandostrong commented Nov 7, 2023

konstmish commented Nov 7, 2023

brandostrong commented Nov 7, 2023

madman404 commented Nov 9, 2023

brandostrong commented Nov 9, 2023

konstmish commented Nov 12, 2023

umarbutler commented Nov 29, 2023

konstmish commented Nov 29, 2023

askerlee commented Jan 4, 2024

phageous commented Nov 9, 2024