-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume training from checkpoints #20361
Labels
Comments
ArkashJ
added
docs
Documentation related
needs triage
Waiting to be triaged by maintainers
labels
Oct 23, 2024
Hye, The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state I think maybe because of a previous version, the wrong solutions have been popularized. |
Yup I went to the docs to figure out the correct solution. I hope this gets
sufficient attention because most of the top results and AI generated
answers are wrong.
Boston University Class of 2024
MS in Computer Science (2022-2024)
BA in Mathematics and Computer Science (2020-2024)
https://www.arkashj.com/
+1 857-701-6117| linkedin.com/in/arkashj
<https://www.linkedin.com/in/arkashj> | <http://goog_2001913241>
https://github.com/ArkashJ
…On Thu, Oct 24, 2024 at 8:02 PM Arijit Ghosh ***@***.***> wrote:
Hye,
The correct definition is indeed mentioned in the official documentation:
https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state
I think maybe because of a previous version, the wrong solutions have been
popularized.
—
Reply to this email directly, view it on GitHub
<#20361 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUWI2YCZZL6OF4APJDYBJOTZ5GDARAVCNFSM6AAAAABQPK5PXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZWGU2DAOJQGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thank you @ArkashJ. Any help popularizing the correct way is welcome, thanks for surfacing. |
lantiga
added
checkpointing
Related to checkpointing
and removed
needs triage
Waiting to be triaged by maintainers
labels
Nov 19, 2024
lantiga
added a commit
that referenced
this issue
Dec 10, 2024
…deprecated (#20361) (#20477) * Update checkpointing documentation to mark resume_from_checkpoint as deprecated * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update docs/source-pytorch/common/checkpointing_basic.rst Co-authored-by: Luca Antiga <[email protected]> * Update docs/source-pytorch/common/checkpointing_basic.rst Co-authored-by: Luca Antiga <[email protected]> * Address review comments * Address review comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Antiga <[email protected]> Co-authored-by: Luca Antiga <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
📚 Documentation
There's a lot of documentation out there about using the
resume_from_checkpoint
keyword in a pytorch trainer however this is wrong. In the latest pytorch version, one needs to provide the path to the checkpoint (.ckpt file) itself in the fit function for the trainer to get it going. here's some popular incorrect references -ChatGPT and claude also got this wrong:
I wanted this to get visibility because knowing how to resume training from checkpoints is imperative and there's a lot of wrong information out there!
cc @Borda @awaelchli
The text was updated successfully, but these errors were encountered: