-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoardLogger and WandbLogger do not track global_step when resuming training from a checkpoint (both manually, and with fault tolerant) #13163
Comments
Any updates on this? I am currently working around this by explicitly logging the global_step from the module's attribute, e.g. https://github.com/mirandrom/lightning-transformer-pretraining/blob/72491177a13482b6b7e3e0e38f420c79e950c55a/ltp/hf_mlm/model.py#L124 |
Hey Guys! Engineer from W&B here! Sorry I'm a little late but I managed to track this down to one line The solution to this is to set I'm not entirely sure if this was intentional but I've pushed the fix anyways. It has caused some tests to break so looking into those but meanwhile this change should get things up and running. |
Also adding |
I am verifying this workaround, will let you know whether it's enough. |
Duplicate of #12274 |
🐛 Bug
When resuming model training from a checkpoint, the TensorboardLogger and WandbLogger will log metrics as if the
global_step
was reset to 0 (although theglobal_step
in the trainer and pl_module are accurate). This issue arises when manually resuming training from a checkpoint using theckpt_path
arg inTrainer.fit
and when doing fault-tolerant training as shown here: https://github.com/PyTorchLightning/pytorch-lightning/blob/1.6.3/pl_examples/fault_tolerant/automatic.pyTo Reproduce
I've adapted the script linked above to test this, running v 1.6.3 of pytorch-lightning:
With tensorboard, running these:
python automatic.py --use_tb
(without fault)python automatic.py --use_tb --emulate_kill_signal
(with fault)python automatic.py --use_tb --emulate_kill_signal
(resume from fault)Results in the following, where the epoch is properly logged, but not the step:
With wandb, running these:
python automatic.py -e [wandb_entity] -p [wandb_project] -r no_fault
(without fault)python automatic.py -e [wandb_entity] -p [wandb_project] -r fault --emulate_kill_signal
(with fault)python automatic.py -e [wandb_entity] -p [wandb_project] -r fault --emulate_kill_signal
(resume from fault)Results in the following, where the step is properly logged (because I'm only logging once per step, see #13016), but the global_step is reset.
Expected behavior
The
trainer/global_step
inWandbLogger
andstep
inTensorBoardLogger
should properly reflect theglobal_step
state of the trainer/pl_module when resuming from checkpoings (either manually or automatically with fault-tolerant training).Environment
The text was updated successfully, but these errors were encountered: