LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016

rbracco · 2022-05-09T14:33:42Z

🐛 Bug

Using the LRMonitor callback breaks wandb logging by causing the step count to become incorrect. The image below shows varying epoch/step count while overfitting batches with No LR monitor, LearningRateMonitor(logging_interval="epoch") and LearningRateMonitor(logging_interval=None)

Neat-bee-446 does not use the LRMonitor callback and the ratio of step#:epoch# is 1:1
Devout-forest-447 adds as a callback LearningRateMonitor(logging_interval="epoch"), and the ratio of step#:epoch# becomes 2:1
Woven-dew-448 uses the callback LearningRateMonitor() and the ratio of step#:epoch# becomes 3:1

When not overfitting a batch, LearningRateMonitor() has the correct number of steps, but LearningRateMonitor(logging_interval="epoch") and LearningRateMonitor(logging_interval="step") still have double what they should

Also, this doesn't occur with tensorboard, only wandb.

Expected behavior

The logged step count should be correct and not adversely impacted by adding the LRMonitor callback.

Environment

PyTorch Lightning Version: 1.6.2
WandB Version: 0.12.5
PyTorch Version (e.g., 1.11:
Python version (e.g., 3.8.10):
OS (e.g., Linux): Linux
How you installed PyTorch: Pip

Additional context

cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire @manangoel99 @rohitgr7

The text was updated successfully, but these errors were encountered:

tanmoyio · 2022-05-11T21:46:11Z

I would like to work on it

akihironitta · 2022-05-22T06:16:19Z

@rbracco Do you see the issue with other loggers in addition to wandb logger? Would you mind trying the default TensorBoardLogger and seeing if the issue is specific to the logger?

Also, it would be great and very helpful if you could share a script for reproduction.

manangoel99 · 2022-05-22T06:18:08Z

Hi @rbracco ! Engineer from W&B here. The likely reason for this is that pytorch-lightning has its own step counter and wandb has its own step counter as well. The wandb step counter is incremented whenever wandb.log is called which can be multiple times within the same lightning step.

For example:

def training_step(self, batch):
    self.log("a", 1)
    self.log("b", 2)

This is 1 training step but this would increment the wandb step twice.
Hence, adding the LRMonitor callback would add another log call which would further increment the wandb step.

awaelchli · 2022-05-22T15:47:50Z

Hey @rbracco if what @manangoel99 is the case, you can change the x-axis to "trainer step" in the UI (top right) either globally or per plot.

manangoel99 · 2022-05-22T16:42:07Z

@rbracco Please let me know if the suggested solution works!

rbracco · 2022-05-22T19:17:47Z

Thank you @awaelchli and @manangoel99 that worked great. Here is the updated chart once I change x-axis to trainer step. Closing!

awaelchli · 2022-05-22T19:34:27Z

yay :) Happy logging!

rbracco added the needs triage Waiting to be triaged by maintainers label May 9, 2022

akihironitta added logger: wandb Weights & Biases callback: lr monitor and removed needs triage Waiting to be triaged by maintainers labels May 22, 2022

akihironitta added the bug Something isn't working label May 22, 2022

rbracco closed this as completed May 22, 2022

mirandrom mentioned this issue May 26, 2022

TensorBoardLogger and WandbLogger do not track global_step when resuming training from a checkpoint (both manually, and with fault tolerant) #13163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016

LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016

rbracco commented May 9, 2022 •

edited by github-actions bot

Loading

tanmoyio commented May 11, 2022

akihironitta commented May 22, 2022 •

edited

Loading

manangoel99 commented May 22, 2022 •

edited

Loading

awaelchli commented May 22, 2022

manangoel99 commented May 22, 2022

rbracco commented May 22, 2022

awaelchli commented May 22, 2022

LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016

LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016

Comments

rbracco commented May 9, 2022 • edited by github-actions bot Loading

🐛 Bug

Expected behavior

Environment

Additional context

tanmoyio commented May 11, 2022

akihironitta commented May 22, 2022 • edited Loading

manangoel99 commented May 22, 2022 • edited Loading

awaelchli commented May 22, 2022

manangoel99 commented May 22, 2022

rbracco commented May 22, 2022

awaelchli commented May 22, 2022

rbracco commented May 9, 2022 •

edited by github-actions bot

Loading

akihironitta commented May 22, 2022 •

edited

Loading

manangoel99 commented May 22, 2022 •

edited

Loading