Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step Inconsistencies #13752

Closed
PiotrDabkowski opened this issue Jul 20, 2022 · 18 comments
Closed

Step Inconsistencies #13752

PiotrDabkowski opened this issue Jul 20, 2022 · 18 comments
Labels
design Includes a design discussion progress tracking (internal) Related to the progress tracking dataclasses question Further information is requested

Comments

@PiotrDabkowski
Copy link

PiotrDabkowski commented Jul 20, 2022

🐛 Bug

The "step" definition is super unclear and inconsistent:

  • global_step, it increments by the number of optimizers. Eg. for GAN it is actually 2*number_of_batches.
  • step as used in learning rate scheduler when interval="step" - number of training batches used.
  • step as used in logging - _batches_that_stepped, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to call self.log("step", self.global_step) to fix wandb logging after resume.
  • step as max_steps in trainer, for this global_step seems to be used.

This is super convoluted to me, why can't 'step' always be simply a number of dataset iterations?

Also when restoring the training, I get negative values, this is also reproduced in Colab:
image

To Reproduce

https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu

Notice inconsistent steps during first training run.
And then completely messed up steps in the resume_from_checkpoint run - negative iteration speed, incorrect _batches_that_stepped that is not being restored correctly.

image

Expected behavior

Steps are consistent and restored properly (_batches_that_stepped used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition of step you come up with should be consistent.
Negative iteration speed and ETA after resume_from_checkpoint are fixed.

Thanks!

Environment

  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.23.0
    • pyTorch_debug: False
    • pyTorch_version: 1.11.0+cu113
    • pytorch-lightning: 1.6.4
    • tqdm: 4.64.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.10
    • version: #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022

cc @tchaton @justusschock @awaelchli @Borda @carmocca @rohitgr7

@PiotrDabkowski PiotrDabkowski added the needs triage Waiting to be triaged by maintainers label Jul 20, 2022
@PiotrDabkowski
Copy link
Author

Edit: _batches_that_stepped restoration seems to have been fixed in 1.6.5, in 1.6.4 it still was resuming from 0.

@rohitgr7
Copy link
Contributor

This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed.

This was fixed in the recent release.

@rohitgr7
Copy link
Contributor

will look into tqdm issue

@rohitgr7 rohitgr7 self-assigned this Jul 20, 2022
@rohitgr7 rohitgr7 added bug Something isn't working progress tracking (internal) Related to the progress tracking dataclasses and removed needs triage Waiting to be triaged by maintainers labels Jul 20, 2022
@PiotrDabkowski
Copy link
Author

Thanks, and re global_step, why would it be inconsistent with _batches_that_stepped?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jul 20, 2022

Thanks, and re global_step, why would it be inconsistent with _batches_that_stepped?

what do you mean? this behavior was changed recently and now it reflects total optimization steps.

they will be the same in most of general cases when you are using a single optimizer.

@PiotrDabkowski
Copy link
Author

IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that?

If I wanted to get total number of optimization steps, should not that be under total_optimization_steps attribute?

@rohitgr7
Copy link
Contributor

cc @carmocca

@PiotrDabkowski
Copy link
Author

PiotrDabkowski commented Jul 20, 2022

Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is "step" refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a "step", and use eg "total_optimizer_step" for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reason...

I can see some reasons for why global_step definition was changed, but as the user I know it will be super unclear for anybody encountering it. Suddenly when switching to GAN or multiple optimizers the behavior of global_step will change and people will need to debug why, because it is super unintuitive.

@PiotrDabkowski PiotrDabkowski changed the title Step Mess Step Inconsistencies Jul 20, 2022
@carmocca
Copy link
Contributor

Hi!

The step name is unfortunately ambiguous and overloaded, however, it's been here since the beginning of the project so it's hard to let go.

With the 1.6 release we had to make some changes to their implementations. After that, global_step refers to the sum of all optimizer.step() calls https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/trainer/trainer.py#L2538-L2544

max_steps compares itself with global_step https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/fit_loop.py#L154 whereas num_sanity_val_steps actually refers to validation batches so it doesn't use global_step.

I would really love if we disentangled these 2 definitions for clarity

I agree. My opinion here is to avoid "step" for what's really a batch and keep "step" for anything optimizer.step related. Note however that global_ prefix is still ambiguous as it does not account for a global count if running in a distributed environment.

and use eg "total_optimizer_step" for the alternative definition

Internally we use a definition like this. But as I mentioned, global_step has been the public name since the beginning: https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L85-L90

Similar attributes are _batches_that_stepped which counts the number of batches processed where we (at least tried to) stepped, accounting for gradient accumulation. This attribute is protected at the moment, but could be exposed if people find it useful. https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L247-L249

You would also have the total_batch_idx in case you want the number of batches ever fetched https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L71-L76

@carmocca carmocca added question Further information is requested design Includes a design discussion and removed bug Something isn't working labels Jul 20, 2022
@PiotrDabkowski
Copy link
Author

PiotrDabkowski commented Jul 23, 2022

Thank you, total_batch_idx is what I will be using then instead of global_step, does it account for gradient acc as well? The thing that is also super confusing is the change of semantics of global_step since 1.6 - IMO this should not have happened and the "fault tolerance" could have been achieved otherwise, but ok. There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx:

image

https://github.com/Lightning-AI/lightning/blob/9f51c07604f52d2e7601471b2550f64dad43aaa4/src/pytorch_lightning/loggers/wandb.py#L365

I guess, there is still some cleanup to do after 1.6 global_step definition change.

@carmocca
Copy link
Contributor

carmocca commented Jul 23, 2022

does it account for gradient acc as well?

total_batch_idx does not, batches_that_stepped does.

@andreapesare
Copy link

There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx

+1 on this. Currently, if self.log("step", ...) is not called, the default value for the trainer/global_step metric on W&B is given by this:

https://github.com/Lightning-AI/lightning/blob/759e89df21684277da996dd25781dee7f21e1b1c/src/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L99-L102

and the metric is logged here:

https://github.com/Lightning-AI/lightning/blob/6df6deac8fdf0c44b509d531441b4f7e446de6b4/src/pytorch_lightning/loggers/wandb.py#L418

Now, I found some inconsistencies:

  1. The metric is called trainer/global_step but is actually different from the trainer.global_step property, which is defined as

https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/trainer/trainer.py#L2538-L2544

(and thus counts all optimizer steps)

  1. In the docstring, it's written that the default value for the metric (during training) is self.global_step (that returns self.trainer.global_step):

https://github.com/Lightning-AI/lightning/blob/759e89df21684277da996dd25781dee7f21e1b1c/src/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L85-L86

but is not.

Would it be possible to fix them? @carmocca

@carmocca
Copy link
Contributor

The docstring you point out in (2) is outdated. Would you like to send a PR updating it?

@Borda Borda self-assigned this Nov 7, 2022
@gau-nernst
Copy link
Contributor

Just adding to this about GAN training. Many GANs train Discriminator for n times (e.g. 5 times) every Generator optimization step. This further messes up global_step in Lightning, so global_step will not even be a multiple of number of batches anymore.

@justusschock
Copy link
Member

@gau-nernst this is actually expected behavior. As @carmocca pointed out, the global step is the sum of all optimizer steps. If you want it to be equal to the number of batches you only update one optimizer per batch.

I.e. instead of something like

if batch_idx % 5 == 0:
   # optim1 update

# optim 2 update 

you do something like

if batch_idx % 6 == 0:
    # optim 1 update
else:
    # optim 2 update

@yuchenlichuck
Copy link

I come across the same problems that I am now using my own optimizer, but the global step doesn't increase, one I use my optimizer, only optimizers().step, will change, however when i use optimizers().step, the optimizers isn't correct.

@gau-nernst
Copy link
Contributor

This is Lightning's design choice. They probably won't fix. Just write your own train script with either Lightning Fabric or HF Accelerate

@carmocca
Copy link
Contributor

Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, I'll go ahead and close this.

@yuchenlichuck I didn't understand your issue, but note that optimizer.step() needs to be called for the global_step to increase. Otherwise you'll run into #16143

@carmocca carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion progress tracking (internal) Related to the progress tracking dataclasses question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants