Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training from checkpoints #20361

Closed
ArkashJ opened this issue Oct 23, 2024 · 3 comments · Fixed by #20477
Closed

Resume training from checkpoints #20361

ArkashJ opened this issue Oct 23, 2024 · 3 comments · Fixed by #20477
Labels
checkpointing Related to checkpointing docs Documentation related

Comments

@ArkashJ
Copy link

ArkashJ commented Oct 23, 2024

📚 Documentation

There's a lot of documentation out there about using the resume_from_checkpoint keyword in a pytorch trainer however this is wrong. In the latest pytorch version, one needs to provide the path to the checkpoint (.ckpt file) itself in the fit function for the trainer to get it going. here's some popular incorrect references -

  1. https://stackoverflow.com/questions/71961436/pytorch-lightning-resuming-from-checkpoint-with-new-data
  2. https://lightning.ai/forums/t/how-to-resume-training/432
  3. Resume training from checkpoint with new data #12845
  4. https://www.youtube.com/watch?v=V5KGEzIwAxQ

ChatGPT and claude also got this wrong:
Uploading Screenshot 2024-10-23 at 1.38.11 PM.png…

I wanted this to get visibility because knowing how to resume training from checkpoints is imperative and there's a lot of wrong information out there!

cc @Borda @awaelchli

@ArkashJ ArkashJ added docs Documentation related needs triage Waiting to be triaged by maintainers labels Oct 23, 2024
@arijit-hub
Copy link

Hye,

The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state

I think maybe because of a previous version, the wrong solutions have been popularized.

@ArkashJ
Copy link
Author

ArkashJ commented Oct 25, 2024 via email

@lantiga
Copy link
Collaborator

lantiga commented Nov 19, 2024

Thank you @ArkashJ. Any help popularizing the correct way is welcome, thanks for surfacing.

@lantiga lantiga added checkpointing Related to checkpointing and removed needs triage Waiting to be triaged by maintainers labels Nov 19, 2024
lantiga added a commit that referenced this issue Dec 10, 2024
…deprecated (#20361) (#20477)

* Update checkpointing documentation to mark resume_from_checkpoint as deprecated

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update docs/source-pytorch/common/checkpointing_basic.rst

Co-authored-by: Luca Antiga <[email protected]>

* Update docs/source-pytorch/common/checkpointing_basic.rst

Co-authored-by: Luca Antiga <[email protected]>

* Address review comments

* Address review comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <[email protected]>
Co-authored-by: Luca Antiga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing docs Documentation related
Projects
None yet
3 participants