Step Inconsistencies #13752

PiotrDabkowski · 2022-07-20T09:12:56Z

🐛 Bug

The "step" definition is super unclear and inconsistent:

global_step, it increments by the number of optimizers. Eg. for GAN it is actually 2*number_of_batches.
step as used in learning rate scheduler when interval="step" - number of training batches used.
step as used in logging - _batches_that_stepped, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to call self.log("step", self.global_step) to fix wandb logging after resume.
step as max_steps in trainer, for this global_step seems to be used.

This is super convoluted to me, why can't 'step' always be simply a number of dataset iterations?

Also when restoring the training, I get negative values, this is also reproduced in Colab:

To Reproduce

https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu

Notice inconsistent steps during first training run.
And then completely messed up steps in the resume_from_checkpoint run - negative iteration speed, incorrect _batches_that_stepped that is not being restored correctly.

Expected behavior

Steps are consistent and restored properly (_batches_that_stepped used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition of step you come up with should be consistent.
Negative iteration speed and ETA after resume_from_checkpoint are fixed.

Thanks!

Environment

CUDA:
- GPU:
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
- available: True
- version: 11.3
Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.10
- version: #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022

cc @tchaton @justusschock @awaelchli @Borda @carmocca @rohitgr7

The text was updated successfully, but these errors were encountered:

PiotrDabkowski · 2022-07-20T09:23:21Z

Edit: _batches_that_stepped restoration seems to have been fixed in 1.6.5, in 1.6.4 it still was resuming from 0.

rohitgr7 · 2022-07-20T09:23:35Z

This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed.

This was fixed in the recent release.

rohitgr7 · 2022-07-20T09:24:06Z

will look into tqdm issue

PiotrDabkowski · 2022-07-20T09:28:13Z

Thanks, and re global_step, why would it be inconsistent with _batches_that_stepped?

rohitgr7 · 2022-07-20T09:29:11Z

Thanks, and re global_step, why would it be inconsistent with _batches_that_stepped?

what do you mean? this behavior was changed recently and now it reflects total optimization steps.

they will be the same in most of general cases when you are using a single optimizer.

PiotrDabkowski · 2022-07-20T09:31:32Z

IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that?

If I wanted to get total number of optimization steps, should not that be under total_optimization_steps attribute?

rohitgr7 · 2022-07-20T09:33:41Z

cc @carmocca

PiotrDabkowski · 2022-07-20T09:39:17Z

Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is "step" refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a "step", and use eg "total_optimizer_step" for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reason...

I can see some reasons for why global_step definition was changed, but as the user I know it will be super unclear for anybody encountering it. Suddenly when switching to GAN or multiple optimizers the behavior of global_step will change and people will need to debug why, because it is super unintuitive.

carmocca · 2022-07-20T16:40:39Z

Hi!

The step name is unfortunately ambiguous and overloaded, however, it's been here since the beginning of the project so it's hard to let go.

With the 1.6 release we had to make some changes to their implementations. After that, global_step refers to the sum of all optimizer.step() calls https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/trainer/trainer.py#L2538-L2544

max_steps compares itself with global_step https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/fit_loop.py#L154 whereas num_sanity_val_steps actually refers to validation batches so it doesn't use global_step.

I would really love if we disentangled these 2 definitions for clarity

I agree. My opinion here is to avoid "step" for what's really a batch and keep "step" for anything optimizer.step related. Note however that global_ prefix is still ambiguous as it does not account for a global count if running in a distributed environment.

and use eg "total_optimizer_step" for the alternative definition

Internally we use a definition like this. But as I mentioned, global_step has been the public name since the beginning: https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L85-L90

Similar attributes are _batches_that_stepped which counts the number of batches processed where we (at least tried to) stepped, accounting for gradient accumulation. This attribute is protected at the moment, but could be exposed if people find it useful. https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L247-L249

You would also have the total_batch_idx in case you want the number of batches ever fetched https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L71-L76

PiotrDabkowski · 2022-07-23T07:10:34Z

Thank you, total_batch_idx is what I will be using then instead of global_step, does it account for gradient acc as well? The thing that is also super confusing is the change of semantics of global_step since 1.6 - IMO this should not have happened and the "fault tolerance" could have been achieved otherwise, but ok. There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx:

https://github.com/Lightning-AI/lightning/blob/9f51c07604f52d2e7601471b2550f64dad43aaa4/src/pytorch_lightning/loggers/wandb.py#L365

I guess, there is still some cleanup to do after 1.6 global_step definition change.

carmocca · 2022-07-23T15:50:31Z

does it account for gradient acc as well?

total_batch_idx does not, batches_that_stepped does.

andreapesare · 2022-09-25T16:26:50Z

There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx

+1 on this. Currently, if self.log("step", ...) is not called, the default value for the trainer/global_step metric on W&B is given by this:

https://github.com/Lightning-AI/lightning/blob/759e89df21684277da996dd25781dee7f21e1b1c/src/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L99-L102

and the metric is logged here:

https://github.com/Lightning-AI/lightning/blob/6df6deac8fdf0c44b509d531441b4f7e446de6b4/src/pytorch_lightning/loggers/wandb.py#L418

Now, I found some inconsistencies:

The metric is called trainer/global_step but is actually different from the trainer.global_step property, which is defined as

https://github.com/Lightning-AI/lightning/blob/ca1917ec80e6a167111ba441154188aa53c9b79b/src/pytorch_lightning/trainer/trainer.py#L2538-L2544

(and thus counts all optimizer steps)

In the docstring, it's written that the default value for the metric (during training) is self.global_step (that returns self.trainer.global_step):

https://github.com/Lightning-AI/lightning/blob/759e89df21684277da996dd25781dee7f21e1b1c/src/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L85-L86

but is not.

Would it be possible to fix them? @carmocca

carmocca · 2022-09-27T12:18:53Z

The docstring you point out in (2) is outdated. Would you like to send a PR updating it?

gau-nernst · 2022-12-31T02:32:50Z

Just adding to this about GAN training. Many GANs train Discriminator for n times (e.g. 5 times) every Generator optimization step. This further messes up global_step in Lightning, so global_step will not even be a multiple of number of batches anymore.

justusschock · 2022-12-31T09:52:51Z

@gau-nernst this is actually expected behavior. As @carmocca pointed out, the global step is the sum of all optimizer steps. If you want it to be equal to the number of batches you only update one optimizer per batch.

I.e. instead of something like

if batch_idx % 5 == 0:
   # optim1 update

# optim 2 update

you do something like

if batch_idx % 6 == 0:
    # optim 1 update
else:
    # optim 2 update

yuchenlichuck · 2023-09-23T20:23:23Z

I come across the same problems that I am now using my own optimizer, but the global step doesn't increase, one I use my optimizer, only optimizers().step, will change, however when i use optimizers().step, the optimizers isn't correct.

gau-nernst · 2023-09-24T01:37:06Z

This is Lightning's design choice. They probably won't fix. Just write your own train script with either Lightning Fabric or HF Accelerate

carmocca · 2023-09-25T14:28:52Z

Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, I'll go ahead and close this.

@yuchenlichuck I didn't understand your issue, but note that optimizer.step() needs to be called for the global_step to increase. Otherwise you'll run into #16143

PiotrDabkowski added the needs triage Waiting to be triaged by maintainers label Jul 20, 2022

rohitgr7 self-assigned this Jul 20, 2022

rohitgr7 added bug Something isn't working progress tracking (internal) Related to the progress tracking dataclasses and removed needs triage Waiting to be triaged by maintainers labels Jul 20, 2022

PiotrDabkowski changed the title ~~Step Mess~~ Step Inconsistencies Jul 20, 2022

carmocca added question Further information is requested design Includes a design discussion and removed bug Something isn't working labels Jul 20, 2022

andreapesare mentioned this issue Sep 25, 2022

Allow to specify the value of the trainer/global_step metric on W&B during validation #14892

Open

Borda self-assigned this Nov 7, 2022

cmh1027 mentioned this issue Apr 5, 2023

Strict warning about using optimizers manually would be needed in the docs #17281

Open

34j mentioned this issue Apr 9, 2023

Newly trained model since Lightning implementation sounds robotic early on voicepaw/so-vits-svc-fork#270

Closed

carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2023

carmocca unassigned Borda and rohitgr7 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step Inconsistencies #13752

Step Inconsistencies #13752

PiotrDabkowski commented Jul 20, 2022 •

edited by github-actions bot

Loading

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022 •

edited

Loading

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

PiotrDabkowski commented Jul 20, 2022 •

edited

Loading

carmocca commented Jul 20, 2022

PiotrDabkowski commented Jul 23, 2022 •

edited

Loading

carmocca commented Jul 23, 2022 •

edited

Loading

andreapesare commented Sep 25, 2022

carmocca commented Sep 27, 2022

gau-nernst commented Dec 31, 2022

justusschock commented Dec 31, 2022

yuchenlichuck commented Sep 23, 2023

gau-nernst commented Sep 24, 2023

carmocca commented Sep 25, 2023

Step Inconsistencies #13752

Step Inconsistencies #13752

Comments

PiotrDabkowski commented Jul 20, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022 • edited Loading

PiotrDabkowski commented Jul 20, 2022

rohitgr7 commented Jul 20, 2022

PiotrDabkowski commented Jul 20, 2022 • edited Loading

carmocca commented Jul 20, 2022

PiotrDabkowski commented Jul 23, 2022 • edited Loading

carmocca commented Jul 23, 2022 • edited Loading

andreapesare commented Sep 25, 2022

carmocca commented Sep 27, 2022

gau-nernst commented Dec 31, 2022

justusschock commented Dec 31, 2022

yuchenlichuck commented Sep 23, 2023

gau-nernst commented Sep 24, 2023

carmocca commented Sep 25, 2023

PiotrDabkowski commented Jul 20, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Jul 20, 2022 •

edited

Loading

PiotrDabkowski commented Jul 20, 2022 •

edited

Loading

PiotrDabkowski commented Jul 23, 2022 •

edited

Loading

carmocca commented Jul 23, 2022 •

edited

Loading