Resuming `Trainer` with `ckpt_path` disrupts logging by `TensorBoardLogger` #12991

arsedler9 · 2022-05-06T01:45:07Z

🐛 Bug

When I fit a model that uses a TensorBoardLogger and then save a checkpoint with Trainer.save_checkpoint, the TensorBoardLogger of the Trainer that resumes from this checkpoint loses the global step count and is not able to log to the same event files. This causes only the data logged by the second logger to show up in TensorBoard, with the global step count starting at zero.

To Reproduce

Run the CoLab notebook I developed for BoringModel. Note that the logged epochs in TensorBoard go from 10 to 20, starting at step zero.

Expected behavior

I would prefer for the logger to resume logging from the global step count of the Trainer that saved the checkpoint, and for the entire training curves from both trainers to be visible in TensorBoard.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 11.3
Packages:
- numpy: 1.21.6
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.3
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.13
- version: # 1 SMP Sun Apr 24 10:03:06 PDT 2022

Additional context

I am using PyTorch Lightning to run Population-Based Training, so I am often saving and resuming models. I'd prefer to have the continuous TensorBoard logs for each worker so I don't end up having to keep track of n_workers * n_generations different traces in TensorBoard.

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @carmocca @edward-io

The text was updated successfully, but these errors were encountered:

arsedler9 · 2022-05-06T01:52:19Z

BTW, the bug_report_model.ipynb on CoLab is a fantastic idea and well executed! Made it super easy for me to replicate this bug.

rohitgr7 · 2022-05-10T08:27:54Z

I think this is exactly this issue: #12274

arsedler9 · 2022-05-10T10:29:16Z

Agreed! Do you think the issue of overwriting / deleting the TB data from the first training session is related to the step number or could that be a separate issue?

…

On May 10, 2022, at 4:28 AM, Rohit Gupta ***@***.***> wrote: I think this is exactly this issue: #12274 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

rohitgr7 · 2022-05-10T11:10:45Z

it's different. The logging step used in the previous run is not restored and is instantiating from 0 in the new run.

rohitgr7 · 2022-07-13T18:15:10Z

fixed here: #13467, closing this.

arsedler9 added the needs triage Waiting to be triaged by maintainers label May 6, 2022

carmocca self-assigned this May 10, 2022

carmocca added progress tracking (internal) Related to the progress tracking dataclasses checkpointing Related to checkpointing logger: tensorboard and removed needs triage Waiting to be triaged by maintainers labels May 10, 2022

carmocca added this to the 1.6.x milestone May 10, 2022

mirandrom mentioned this issue Jun 10, 2022

TensorBoardLogger and WandbLogger do not track global_step when resuming training from a checkpoint (both manually, and with fault tolerant) #13163

Closed

rohitgr7 closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming `Trainer` with `ckpt_path` disrupts logging by `TensorBoardLogger` #12991

Resuming `Trainer` with `ckpt_path` disrupts logging by `TensorBoardLogger` #12991

arsedler9 commented May 6, 2022 •

edited by github-actions bot

Loading

arsedler9 commented May 6, 2022

rohitgr7 commented May 10, 2022

arsedler9 commented May 10, 2022 via email

rohitgr7 commented May 10, 2022

rohitgr7 commented Jul 13, 2022

Resuming Trainer with ckpt_path disrupts logging by TensorBoardLogger #12991

Resuming Trainer with ckpt_path disrupts logging by TensorBoardLogger #12991

Comments

arsedler9 commented May 6, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

arsedler9 commented May 6, 2022

rohitgr7 commented May 10, 2022

arsedler9 commented May 10, 2022 via email

rohitgr7 commented May 10, 2022

rohitgr7 commented Jul 13, 2022

Resuming `Trainer` with `ckpt_path` disrupts logging by `TensorBoardLogger` #12991

Resuming `Trainer` with `ckpt_path` disrupts logging by `TensorBoardLogger` #12991

arsedler9 commented May 6, 2022 •

edited by github-actions bot

Loading