Resuming Trainer
with ckpt_path
disrupts logging by TensorBoardLogger
#12991
Labels
checkpointing
Related to checkpointing
logger: tensorboard
progress tracking (internal)
Related to the progress tracking dataclasses
Milestone
🐛 Bug
When I fit a model that uses a
TensorBoardLogger
and then save a checkpoint withTrainer.save_checkpoint
, theTensorBoardLogger
of theTrainer
that resumes from this checkpoint loses the global step count and is not able to log to the same event files. This causes only the data logged by the second logger to show up in TensorBoard, with the global step count starting at zero.To Reproduce
Run the CoLab notebook I developed for
BoringModel
. Note that the logged epochs in TensorBoard go from 10 to 20, starting at step zero.Expected behavior
I would prefer for the logger to resume logging from the global step count of the
Trainer
that saved the checkpoint, and for the entire training curves from both trainers to be visible in TensorBoard.Environment
Additional context
I am using PyTorch Lightning to run Population-Based Training, so I am often saving and resuming models. I'd prefer to have the continuous TensorBoard logs for each worker so I don't end up having to keep track of
n_workers * n_generations
different traces in TensorBoard.cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @carmocca @edward-io
The text was updated successfully, but these errors were encountered: