[bug] [docs] Clearer optimizer_step override instructions #4455

ananyahjha93 · 2020-10-31T08:36:36Z

What does this PR do?

Need to add a failing test first in the master.

bug: when the user overrides optimizer_step() function and calls optimizer.step() without using optimizer_closure parameter, the code crashes for 2 reasons:

accelerator.py's call to optimizer_step is missing the on_tpu=False parameter.
training_loop.py passes train_step_and_backward_closure as optimizer_closure, which is results in not calling training_step if the user overrides optimizer_step() in LightningModule and does not use optimizer.step(closure=optimizer_closure)

Updates the docs to indicate that the user must pass optimizer_closure param to optimizer.step() when overriding optimizer_step function. This is required since training_loop.py defines train_step_and_backward_closure() within run_training_batch()

Fixes #4452, #4447.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-10-31T08:42:32Z

Codecov Report

Merging #4455 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #4455   +/-   ##
======================================
  Coverage      92%     92%           
======================================
  Files         116     116           
  Lines        8700    8700           
======================================
  Hits         8044    8044           
  Misses        656     656

ananyahjha93 · 2020-10-31T11:50:09Z

@SeanNaren is adding the test for this.

SeanNaren · 2020-10-31T12:34:04Z

hey @ananyahjha93, I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:

import os

from torch.utils.data import DataLoader, Dataset

import pytorch_lightning as pl
from pytorch_lightning.utilities import AMPType

import torch
from pytorch_lightning import LightningModule
import torch_xla.core.xla_model as xm


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('fake_test_acc', loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

    def optimizer_step(
            self,
            epoch: int,
            batch_idx: int,
            optimizer,
            optimizer_idx: int,
            optimizer_closure=None,
            on_tpu: bool = False,
            using_native_amp: bool = False,
            using_lbfgs: bool = False
    ) -> None:
        if on_tpu:
            xm.optimizer_step(optimizer, optimizer_args={'closure': optimizer_closure})
        elif self.trainer.amp_backend == AMPType.NATIVE:
            # native amp does not yet support closures.
            # TODO: pass the closure to the step ASAP
            optimizer_closure()
            self.trainer.scaler.step(optimizer)
        elif self.trainer.amp_backend == AMPType.APEX:
            # apex amp does not yet support closures.
            # TODO: pass the closure to the step ASAP
            optimizer_closure()
            optimizer.step()
        else:
            optimizer.step(optimizer_closure)


def test_x(tmpdir):
    num_samples = 10000

    train = RandomDataset(32, num_samples)
    train = DataLoader(train, batch_size=32)

    val = RandomDataset(32, num_samples)
    val = DataLoader(val, batch_size=32)

    test = RandomDataset(32, num_samples)
    test = DataLoader(test, batch_size=32)

    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1,
        progress_bar_refresh_rate=1,
        tpu_cores=1
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)


tmpdir = os.getcwd()
test_x(tmpdir)

Also when overriding optimizer_step, we should expect the user override the exact same function method with defaults (as I've done above).

I suggest removing the code changes and keeping documentation fixes in!

justusschock · 2020-11-02T09:02:42Z

@ananyahjha93 What do you think if we check if optimizer_step takes a closure and if not manually call it beforehand (like we already do for AMP)?

tchaton · 2020-11-02T09:09:32Z

@ananyahjha93 What do you think if we check if optimizer_step takes a closure and if not manually call it beforehand (like we already do for AMP)?

I think it is a good idea !

ananyahjha93 · 2020-11-02T10:40:06Z

@SeanNaren "I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:" - don't understand this,

when you set tpu_cores=1 in the bug script, the TPU accelerator is enabled which calls on_tpu=True. But without the default on_tpu: bool = False in the definition of optimizer_step, it will fail since accelerator.py doesn't call on_tpu=False for optimizer_step. I wouldn't rely on the user getting the exact definition of optimizer_step right (with defaults), when overriding the method.

ananyahjha93 · 2020-11-02T10:41:20Z

@tchaton @justusschock I think that might be a good idea, we can have a failing test where optimizer.step() without closure fails to call training_step and then add a fix which makes the test pass.

SeanNaren · 2020-11-02T10:51:00Z

"I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:" - don't understand this

Offline, the bug report model script you sent never enabled tpu_cores>1

I wouldn't rely on the user getting the exact definition of optimizer_step right (with defaults), when overriding the method.

ah understood, I thought we would expect the user to define override the optimizer_step with defaults.

With more thought I still don't think we should expect the user to override the function without defining the exact function definition they are overriding, which includes defaults. If this is done correctly, then everything works as expected, we can't guarantee the base class is going to run if the overridden function definition isn't the same. @ananyahjha93 thoughts?

we can have a failing test where optimizer.step() without closure fails to call training_step and then add a fix which makes the test pass.

I don't think this is a testable feature, after speaking to Justus they were talking about custom optimizers but in the case of this issue we're discussing overridden optimizer_step. I think in this case we have to just have the doc change to ensure that the user passes in the closure as expected.

EDIT: after speaking to ananya we should remove default set arguments from the lightning module, and enforce parameters to be set correctly in the accelerators.

SeanNaren · 2020-11-02T18:34:02Z

PR should be ready to review. There are now no defaults set within the module object, and accelerators are responsible for setting arguments correctly.

Function when overridden do not need to contain defaults anymore.

ananyahjha93 · 2020-11-02T19:52:43Z

pytorch_lightning/accelerators/accelerator.py

+            optimizer=optimizer,
+            optimizer_idx=opt_idx,
+            optimizer_closure=lambda_closure,
+            on_tpu=False,


on_tpu=False, # TPUAccelerator sets this as True

this should be here

ananyahjha93 · 2020-11-02T19:53:03Z

pytorch_lightning/accelerators/tpu_accelerator.py

+            optimizer=optimizer,
+            optimizer_idx=opt_idx,
+            optimizer_closure=lambda_closure,
+            on_tpu=False,  # TPUAccelerator sets this as True


this should be set to True like before

* fix * flags * remove defaults (cherry picked from commit 01ab2a9)

* fix * flags * remove defaults

fix

e5d1391

ananyahjha93 requested review from awaelchli, Borda, justusschock, nateraw, SeanNaren, tchaton, teddykoker and williamFalcon as code owners October 31, 2020 08:36

ananyahjha93 mentioned this pull request Oct 31, 2020

update swav to override optimizer_step with optimizer.step(closure=op… Lightning-Universe/lightning-bolts#323

Merged

SeanNaren added the bug Something isn't working label Oct 31, 2020

ananyahjha93 marked this pull request as draft October 31, 2020 11:42

ananyahjha93 assigned ananyahjha93 and SeanNaren Oct 31, 2020

SeanNaren changed the title ~~optimizer_step override fix~~ [bug] optimizer_step override fix Oct 31, 2020

SeanNaren added docs Documentation related working as intended Working as intended and removed bug Something isn't working labels Oct 31, 2020

SeanNaren changed the title ~~[bug] optimizer_step override fix~~ [docs] Clearer optimizer_step override instructions Oct 31, 2020

awaelchli approved these changes Nov 2, 2020

View reviewed changes

SeanNaren added bug Something isn't working and removed working as intended Working as intended labels Nov 2, 2020

SeanNaren changed the title ~~[docs] Clearer optimizer_step override instructions~~ [bug] [docs] Clearer optimizer_step override instructions Nov 2, 2020

ananyahjha93 changed the title ~~[bug] [docs] Clearer optimizer_step override instructions~~ [bug] Clearer optimizer_step override instructions Nov 2, 2020

ananyahjha93 changed the title ~~[bug] Clearer optimizer_step override instructions~~ [bug] [docs] Clearer optimizer_step override instructions Nov 2, 2020

SeanNaren marked this pull request as ready for review November 2, 2020 18:32

rohitgr7 approved these changes Nov 2, 2020

View reviewed changes

ananyahjha93 commented Nov 2, 2020

View reviewed changes

ananyahjha93 force-pushed the optimizer_step/training_step branch from 840734e to e5d1391 Compare November 2, 2020 20:34

flags

02ca22e

ananyahjha93 force-pushed the optimizer_step/training_step branch from fef10db to 02ca22e Compare November 2, 2020 20:35

teddykoker approved these changes Nov 2, 2020

View reviewed changes

ananyahjha93 added 2 commits November 2, 2020 16:14

remove defaults

56e4688

Merge branch 'master' into optimizer_step/training_step

68ffdb0

ananyahjha93 force-pushed the optimizer_step/training_step branch from d8202ba to 68ffdb0 Compare November 2, 2020 21:15

SeanNaren merged commit 01ab2a9 into master Nov 2, 2020

SeanNaren deleted the optimizer_step/training_step branch November 2, 2020 22:13

SeanNaren mentioned this pull request Nov 3, 2020

In 1.0.4, training_step is never called when optimizer_step is overriden #4447

Closed

Borda added this to the 1.0.x milestone Nov 3, 2020

Borda pushed a commit that referenced this pull request Nov 3, 2020

[bug] [docs] Clearer optimizer_step override instructions (#4455)

ee33fa7

* fix * flags * remove defaults (cherry picked from commit 01ab2a9)

Borda pushed a commit that referenced this pull request Nov 4, 2020

[bug] [docs] Clearer optimizer_step override instructions (#4455)

462fffd

* fix * flags * remove defaults (cherry picked from commit 01ab2a9)

murnanedaniel mentioned this pull request Nov 6, 2020

Warm-up technique no longer works as documented #4554

Closed

rohitgr7 pushed a commit that referenced this pull request Nov 21, 2020

[bug] [docs] Clearer optimizer_step override instructions (#4455)

7b019b8

* fix * flags * remove defaults

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] [docs] Clearer optimizer_step override instructions #4455

[bug] [docs] Clearer optimizer_step override instructions #4455

ananyahjha93 commented Oct 31, 2020 •

edited

Loading

codecov bot commented Oct 31, 2020 •

edited

Loading

ananyahjha93 commented Oct 31, 2020

SeanNaren commented Oct 31, 2020 •

edited

Loading

justusschock commented Nov 2, 2020 •

edited

Loading

tchaton commented Nov 2, 2020

ananyahjha93 commented Nov 2, 2020

ananyahjha93 commented Nov 2, 2020

SeanNaren commented Nov 2, 2020 •

edited

Loading

SeanNaren commented Nov 2, 2020 •

edited

Loading

ananyahjha93 Nov 2, 2020

ananyahjha93 Nov 2, 2020

[bug] [docs] Clearer optimizer_step override instructions #4455

[bug] [docs] Clearer optimizer_step override instructions #4455

Conversation

ananyahjha93 commented Oct 31, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Oct 31, 2020 • edited Loading

Codecov Report

ananyahjha93 commented Oct 31, 2020

SeanNaren commented Oct 31, 2020 • edited Loading

justusschock commented Nov 2, 2020 • edited Loading

tchaton commented Nov 2, 2020

ananyahjha93 commented Nov 2, 2020

ananyahjha93 commented Nov 2, 2020

SeanNaren commented Nov 2, 2020 • edited Loading

SeanNaren commented Nov 2, 2020 • edited Loading

ananyahjha93 Nov 2, 2020

Choose a reason for hiding this comment

ananyahjha93 Nov 2, 2020

Choose a reason for hiding this comment

ananyahjha93 commented Oct 31, 2020 •

edited

Loading

codecov bot commented Oct 31, 2020 •

edited

Loading

SeanNaren commented Oct 31, 2020 •

edited

Loading

justusschock commented Nov 2, 2020 •

edited

Loading

SeanNaren commented Nov 2, 2020 •

edited

Loading

SeanNaren commented Nov 2, 2020 •

edited

Loading