Resuming training resets the logged step number #12274

eladsegal · 2022-03-09T17:45:20Z

🐛 Bug

The change introduced in #11805 causes a reset to the logged step number.
https://github.com/PyTorchLightning/pytorch-lightning/blob/49a4a36ad45b937dd0124ecfb08eb7400dbf3950/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L122

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run(ckpt_path=None):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=2,
        enable_model_summary=False,
        callbacks=ModelCheckpoint(dirpath="checkpoints", save_top_k=-1, filename="{epoch}", save_on_train_epoch_end=False),
        log_every_n_steps=1,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data, ckpt_path=ckpt_path)


if __name__ == "__main__":
    run()
    run("checkpoints/epoch=0.ckpt")

The script will create two tensorboard logs:

version_0: steps 0 to 63
version_1: steps 0 to 31

Expected behavior

version_1: steps 31 to 63

This was the behavior before #11805

Environment

PyTorch Lightning Version (e.g., 1.5.0): master (49a4a36)
Fault-tolerant training is off (PL_FAULT_TOLERANT_TRAINING=0)

cc @tchaton @rohitgr7 @akihironitta @awaelchli @ananthsub @ninginthecloud @carmocca

The text was updated successfully, but these errors were encountered:

toriving · 2022-04-18T15:55:56Z

Any progress on this issue?
Or does a workaround exist?

ZENGYIMING-EAMON · 2022-04-26T07:37:00Z

Same bug here for PL 1.6.1. Any progress on this issue?

ZENGYIMING-EAMON · 2022-04-26T07:38:14Z

Or how can we hack this to work around?

rohitgr7 · 2022-04-26T10:25:41Z

just wondering if you could just point _batches_that_stepped to global_step / number of optimizers?

ZENGYIMING-EAMON · 2022-04-26T11:06:27Z

Don’t know how to achieve this. What do you mean by global_step / number of optimizers ? and why the _batches_that_stepped should be pointed to it?

rbregier · 2022-05-09T18:24:00Z

Hi, this workaround seems to work for my use case:

checkpoint = torch.load(args.ckpt_path, map_location='cpu')
global_step_offset = checkpoint["global_step"]
trainer.fit_loop.epoch_loop._batches_that_stepped = global_step_offset
del checkpoint    
trainer.fit(experiment, datamodule=datamodule, ckpt_path=args.ckpt_path)

rohitgr7 · 2022-05-10T08:12:46Z

cc: @carmocca wdyt?

ananthsub added progress tracking (internal) Related to the progress tracking dataclasses checkpointing Related to checkpointing labels Mar 9, 2022

carmocca self-assigned this Mar 9, 2022

carmocca added the priority: 0 High priority task label Mar 9, 2022

carmocca modified the milestones: 1.6, 1.6.x Mar 24, 2022

rohitgr7 mentioned this issue May 10, 2022

Resuming Trainer with ckpt_path disrupts logging by TensorBoardLogger #12991

Closed

ashleve mentioned this issue May 19, 2022

Add logger state dumping and restoring #13069

Open

12 tasks

mirandrom mentioned this issue Jun 10, 2022

TensorBoardLogger and WandbLogger do not track global_step when resuming training from a checkpoint (both manually, and with fault tolerant) #13163

Closed

rohitgr7 mentioned this issue Jun 24, 2022

Fixed step variable sent to loggers to keep it consistent with trainer.global_step #13397

Closed

12 tasks

rohitgr7 assigned rohitgr7 and unassigned carmocca Jun 29, 2022

rohitgr7 mentioned this issue Jun 30, 2022

Restore log step during restart #13467

Merged

12 tasks

rohitgr7 closed this as completed in #13467 Jul 12, 2022

thewh1teagle mentioned this issue Sep 11, 2024

restore epoch and step information when resuming training mush42/optispeech#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training resets the logged step number #12274

Resuming training resets the logged step number #12274

eladsegal commented Mar 9, 2022 •

edited by github-actions bot

Loading

toriving commented Apr 18, 2022 •

edited

Loading

ZENGYIMING-EAMON commented Apr 26, 2022

ZENGYIMING-EAMON commented Apr 26, 2022

rohitgr7 commented Apr 26, 2022

ZENGYIMING-EAMON commented Apr 26, 2022

rbregier commented May 9, 2022

rohitgr7 commented May 10, 2022

Resuming training resets the logged step number #12274

Resuming training resets the logged step number #12274

Comments

eladsegal commented Mar 9, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

toriving commented Apr 18, 2022 • edited Loading

ZENGYIMING-EAMON commented Apr 26, 2022

ZENGYIMING-EAMON commented Apr 26, 2022

rohitgr7 commented Apr 26, 2022

ZENGYIMING-EAMON commented Apr 26, 2022

rbregier commented May 9, 2022

rohitgr7 commented May 10, 2022

eladsegal commented Mar 9, 2022 •

edited by github-actions bot

Loading

toriving commented Apr 18, 2022 •

edited

Loading