You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When resuming training from checkpoint it starts from 0 epoch and 0 step although it should be much higher and I can hear that it's already trained.
Can you fix it so it will restore it so I can keep track on the step counter when resuming?
hi,
Is this error occurring in --restore-from-checkpoint argument from the trainer? That didn't happen when I used it, although I think it is used more for finetuning purposes. If not, can you try that argument?
Cheers.
@thewh1teagle
are you using the forced_resume argument?
I added this to be able to (re)load the model when I make changes to it's architecture while benefiting from already trained layers.
It is very niche use case, but I added it to save myself some time during development.
If you want to resume training normally, then the ckpt_path is all you need.
When resuming training from checkpoint it starts from 0 epoch and 0 step although it should be much higher and I can hear that it's already trained.
Can you fix it so it will restore it so I can keep track on the step counter when resuming?
Lightning-AI/pytorch-lightning#12274
The text was updated successfully, but these errors were encountered: