Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore epoch and step information when resuming training #5

Closed
thewh1teagle opened this issue Sep 11, 2024 · 3 comments
Closed

restore epoch and step information when resuming training #5

thewh1teagle opened this issue Sep 11, 2024 · 3 comments

Comments

@thewh1teagle
Copy link

When resuming training from checkpoint it starts from 0 epoch and 0 step although it should be much higher and I can hear that it's already trained.
Can you fix it so it will restore it so I can keep track on the step counter when resuming?

Lightning-AI/pytorch-lightning#12274

@rmcpantoja
Copy link

rmcpantoja commented Sep 11, 2024

hi,
Is this error occurring in --restore-from-checkpoint argument from the trainer? That didn't happen when I used it, although I think it is used more for finetuning purposes. If not, can you try that argument?
Cheers.

@thewh1teagle
Copy link
Author

occurring in --restore-from-checkpoint argument from the trainer?

Hey,
I don't see this argument in the trainer, maybe your'e talking about old version?

https://github.com/mush42/optispeech/blob/main/configs/train.yaml

@mush42
Copy link
Owner

mush42 commented Sep 11, 2024

@thewh1teagle
are you using the forced_resume argument?
I added this to be able to (re)load the model when I make changes to it's architecture while benefiting from already trained layers.
It is very niche use case, but I added it to save myself some time during development.

If you want to resume training normally, then the ckpt_path is all you need.

@mush42 mush42 closed this as completed Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants