Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming Trainer with ckpt_path disrupts logging by TensorBoardLogger #12991

Closed
arsedler9 opened this issue May 6, 2022 · 5 comments
Closed
Assignees
Labels
checkpointing Related to checkpointing logger: tensorboard progress tracking (internal) Related to the progress tracking dataclasses
Milestone

Comments

@arsedler9
Copy link

arsedler9 commented May 6, 2022

🐛 Bug

When I fit a model that uses a TensorBoardLogger and then save a checkpoint with Trainer.save_checkpoint, the TensorBoardLogger of the Trainer that resumes from this checkpoint loses the global step count and is not able to log to the same event files. This causes only the data logged by the second logger to show up in TensorBoard, with the global step count starting at zero.

To Reproduce

Run the CoLab notebook I developed for BoringModel. Note that the logged epochs in TensorBoard go from 10 to 20, starting at step zero.

Expected behavior

I would prefer for the logger to resume logging from the global step count of the Trainer that saved the checkpoint, and for the entire training curves from both trainers to be visible in TensorBoard.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.21.6
    • pyTorch_debug: False
    • pyTorch_version: 1.11.0+cu113
    • pytorch-lightning: 1.6.3
    • tqdm: 4.64.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.13
    • version: # 1 SMP Sun Apr 24 10:03:06 PDT 2022

Additional context

I am using PyTorch Lightning to run Population-Based Training, so I am often saving and resuming models. I'd prefer to have the continuous TensorBoard logs for each worker so I don't end up having to keep track of n_workers * n_generations different traces in TensorBoard.

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @carmocca @edward-io

@arsedler9 arsedler9 added the needs triage Waiting to be triaged by maintainers label May 6, 2022
@arsedler9
Copy link
Author

BTW, the bug_report_model.ipynb on CoLab is a fantastic idea and well executed! Made it super easy for me to replicate this bug.

@carmocca carmocca self-assigned this May 10, 2022
@carmocca carmocca added progress tracking (internal) Related to the progress tracking dataclasses checkpointing Related to checkpointing logger: tensorboard and removed needs triage Waiting to be triaged by maintainers labels May 10, 2022
@carmocca carmocca added this to the 1.6.x milestone May 10, 2022
@rohitgr7
Copy link
Contributor

I think this is exactly this issue: #12274

@arsedler9
Copy link
Author

arsedler9 commented May 10, 2022 via email

@rohitgr7
Copy link
Contributor

it's different. The logging step used in the previous run is not restored and is instantiating from 0 in the new run.

@rohitgr7
Copy link
Contributor

fixed here: #13467, closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing logger: tensorboard progress tracking (internal) Related to the progress tracking dataclasses
Projects
None yet
Development

No branches or pull requests

3 participants