-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step Inconsistencies #13752
Comments
Edit: |
This was fixed in the recent release. |
will look into tqdm issue |
Thanks, and re |
what do you mean? this behavior was changed recently and now it reflects total optimization steps. they will be the same in most of general cases when you are using a single optimizer. |
IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that? If I wanted to get total number of optimization steps, should not that be under |
cc @carmocca |
Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is "step" refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a "step", and use eg "total_optimizer_step" for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reason... I can see some reasons for why |
Hi! The step name is unfortunately ambiguous and overloaded, however, it's been here since the beginning of the project so it's hard to let go. With the 1.6 release we had to make some changes to their implementations. After that,
I agree. My opinion here is to avoid "step" for what's really a batch and keep "step" for anything
Internally we use a definition like this. But as I mentioned, Similar attributes are You would also have the |
Thank you, I guess, there is still some cleanup to do after 1.6 |
|
+1 on this. Currently, if and the metric is logged here: Now, I found some inconsistencies:
(and thus counts all optimizer steps)
but is not. Would it be possible to fix them? @carmocca |
The docstring you point out in (2) is outdated. Would you like to send a PR updating it? |
Just adding to this about GAN training. Many GANs train Discriminator for n times (e.g. 5 times) every Generator optimization step. This further messes up |
@gau-nernst this is actually expected behavior. As @carmocca pointed out, the global step is the sum of all optimizer steps. If you want it to be equal to the number of batches you only update one optimizer per batch. I.e. instead of something like if batch_idx % 5 == 0:
# optim1 update
# optim 2 update you do something like if batch_idx % 6 == 0:
# optim 1 update
else:
# optim 2 update |
I come across the same problems that I am now using my own optimizer, but the global step doesn't increase, one I use my optimizer, only optimizers().step, will change, however when i use optimizers().step, the optimizers isn't correct. |
This is Lightning's design choice. They probably won't fix. Just write your own train script with either Lightning Fabric or HF Accelerate |
Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, I'll go ahead and close this. @yuchenlichuck I didn't understand your issue, but note that |
🐛 Bug
The "step" definition is super unclear and inconsistent:
global_step
, it increments by the number of optimizers. Eg. for GAN it is actually2*number_of_batches
.step
as used in learning rate scheduler wheninterval="step"
- number of training batches used.step
as used in logging -_batches_that_stepped
, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to callself.log("step", self.global_step)
to fix wandb logging after resume.step
asmax_steps
in trainer, for thisglobal_step
seems to be used.This is super convoluted to me, why can't 'step' always be simply a number of dataset iterations?
Also when restoring the training, I get negative values, this is also reproduced in Colab:
To Reproduce
https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu
Notice inconsistent steps during first training run.
And then completely messed up steps in the
resume_from_checkpoint
run - negative iteration speed, incorrect_batches_that_stepped
that is not being restored correctly.Expected behavior
Steps are consistent and restored properly (
_batches_that_stepped
used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition ofstep
you come up with should be consistent.Negative iteration speed and ETA after
resume_from_checkpoint
are fixed.Thanks!
Environment
#54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022
cc @tchaton @justusschock @awaelchli @Borda @carmocca @rohitgr7
The text was updated successfully, but these errors were encountered: