-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LearningRateMonitor callback causes unexpected changes in step/epoch count with WandBLogger #13016
Comments
I would like to work on it |
@rbracco Do you see the issue with other loggers in addition to wandb logger? Would you mind trying the default TensorBoardLogger and seeing if the issue is specific to the logger? Also, it would be great and very helpful if you could share a script for reproduction. |
Hi @rbracco ! Engineer from W&B here. The likely reason for this is that pytorch-lightning has its own step counter and wandb has its own step counter as well. The wandb step counter is incremented whenever wandb.log is called which can be multiple times within the same lightning step. For example:
This is 1 training step but this would increment the wandb step twice. |
Hey @rbracco if what @manangoel99 is the case, you can change the x-axis to "trainer step" in the UI (top right) either globally or per plot. |
@rbracco Please let me know if the suggested solution works! |
Thank you @awaelchli and @manangoel99 that worked great. Here is the updated chart once I change x-axis to trainer step. Closing! |
yay :) Happy logging! |
🐛 Bug
Using the LRMonitor callback breaks wandb logging by causing the step count to become incorrect. The image below shows varying epoch/step count while overfitting batches with No LR monitor,
LearningRateMonitor(logging_interval="epoch")
andLearningRateMonitor(logging_interval=None)
Neat-bee-446 does not use the LRMonitor callback and the ratio of step#:epoch# is 1:1
Devout-forest-447 adds as a callback
LearningRateMonitor(logging_interval="epoch")
, and the ratio of step#:epoch# becomes 2:1Woven-dew-448 uses the callback
LearningRateMonitor()
and the ratio of step#:epoch# becomes 3:1When not overfitting a batch, LearningRateMonitor() has the correct number of steps, but
LearningRateMonitor(logging_interval="epoch")
andLearningRateMonitor(logging_interval="step")
still have double what they shouldAlso, this doesn't occur with tensorboard, only wandb.
Expected behavior
The logged step count should be correct and not adversely impacted by adding the LRMonitor callback.
Environment
Additional context
cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire @manangoel99 @rohitgr7
The text was updated successfully, but these errors were encountered: