Training doesn't resume from previous checkpoint using max_train_steps #1172

playerzer0x · 2024-11-21T01:59:08Z

I train a model to 10k steps
I change max_train_steps in config to 15000
I change out the data loader with an updated multidatabackend
I start training and receive this error:
2024-11-21 01:52:34,920 [INFO] Reached the end (58 epochs) of our training run (42 epochs). This run will do zero steps.
Training doesn't continue

If I set max_train_steps to 0 and change num_train_epochs to 100, training starts fine. Haven't counted, but the updated dataset for resume may be less than the original dataset used.

My brain thinks in steps, so would prefer to use steps over epochs.

The text was updated successfully, but these errors were encountered:

bghira · 2024-11-21T03:53:17Z

well, that is normal. you are no longer resuming the old training run, as you have changed everything.

it's not really recommended to change anything within a single training run, let alone the entire dataset or the step schedule

playerzer0x · 2024-11-21T21:11:00Z

This change would be across two separate training runs. I'm following Caith's recommendation on training new subjects into a "base LoKR" that was previously trained on styles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training doesn't resume from previous checkpoint using max_train_steps #1172

Training doesn't resume from previous checkpoint using max_train_steps #1172

playerzer0x commented Nov 21, 2024

bghira commented Nov 21, 2024

playerzer0x commented Nov 21, 2024

Training doesn't resume from previous checkpoint using max_train_steps #1172

Training doesn't resume from previous checkpoint using max_train_steps #1172

Comments

playerzer0x commented Nov 21, 2024

bghira commented Nov 21, 2024

playerzer0x commented Nov 21, 2024