Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] [docs] Clearer optimizer_step override instructions #4455

Merged
merged 4 commits into from
Nov 2, 2020

Conversation

ananyahjha93
Copy link
Contributor

@ananyahjha93 ananyahjha93 commented Oct 31, 2020

What does this PR do?

Need to add a failing test first in the master.

bug: when the user overrides optimizer_step() function and calls optimizer.step() without using optimizer_closure parameter, the code crashes for 2 reasons:

  • accelerator.py's call to optimizer_step is missing the on_tpu=False parameter.
  • training_loop.py passes train_step_and_backward_closure as optimizer_closure, which is results in not calling training_step if the user overrides optimizer_step() in LightningModule and does not use optimizer.step(closure=optimizer_closure)

Updates the docs to indicate that the user must pass optimizer_closure param to optimizer.step() when overriding optimizer_step function. This is required since training_loop.py defines train_step_and_backward_closure() within run_training_batch()

Fixes #4452, #4447.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Oct 31, 2020

Codecov Report

Merging #4455 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #4455   +/-   ##
======================================
  Coverage      92%     92%           
======================================
  Files         116     116           
  Lines        8700    8700           
======================================
  Hits         8044    8044           
  Misses        656     656           

@ananyahjha93
Copy link
Contributor Author

@SeanNaren is adding the test for this.

@SeanNaren SeanNaren changed the title optimizer_step override fix [bug] optimizer_step override fix Oct 31, 2020
@SeanNaren
Copy link
Contributor

SeanNaren commented Oct 31, 2020

hey @ananyahjha93, I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:

import os

from torch.utils.data import DataLoader, Dataset

import pytorch_lightning as pl
from pytorch_lightning.utilities import AMPType

import torch
from pytorch_lightning import LightningModule
import torch_xla.core.xla_model as xm


class RandomDataset(Dataset):
    def __init__(self, size, num_samples):
        self.len = num_samples
        self.data = torch.randn(num_samples, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('fake_test_acc', loss)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

    def optimizer_step(
            self,
            epoch: int,
            batch_idx: int,
            optimizer,
            optimizer_idx: int,
            optimizer_closure=None,
            on_tpu: bool = False,
            using_native_amp: bool = False,
            using_lbfgs: bool = False
    ) -> None:
        if on_tpu:
            xm.optimizer_step(optimizer, optimizer_args={'closure': optimizer_closure})
        elif self.trainer.amp_backend == AMPType.NATIVE:
            # native amp does not yet support closures.
            # TODO: pass the closure to the step ASAP
            optimizer_closure()
            self.trainer.scaler.step(optimizer)
        elif self.trainer.amp_backend == AMPType.APEX:
            # apex amp does not yet support closures.
            # TODO: pass the closure to the step ASAP
            optimizer_closure()
            optimizer.step()
        else:
            optimizer.step(optimizer_closure)


def test_x(tmpdir):
    num_samples = 10000

    train = RandomDataset(32, num_samples)
    train = DataLoader(train, batch_size=32)

    val = RandomDataset(32, num_samples)
    val = DataLoader(val, batch_size=32)

    test = RandomDataset(32, num_samples)
    test = DataLoader(test, batch_size=32)

    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1,
        progress_bar_refresh_rate=1,
        tpu_cores=1
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)


tmpdir = os.getcwd()
test_x(tmpdir)

Also when overriding optimizer_step, we should expect the user override the exact same function method with defaults (as I've done above).

I suggest removing the code changes and keeping documentation fixes in!

@SeanNaren SeanNaren added docs Documentation related working as intended Working as intended and removed bug Something isn't working labels Oct 31, 2020
@SeanNaren SeanNaren changed the title [bug] optimizer_step override fix [docs] Clearer optimizer_step override instructions Oct 31, 2020
@justusschock
Copy link
Member

justusschock commented Nov 2, 2020

@ananyahjha93 What do you think if we check if optimizer_step takes a closure and if not manually call it beforehand (like we already do for AMP)?

@tchaton
Copy link
Contributor

tchaton commented Nov 2, 2020

@ananyahjha93 What do you think if we check if optimizer_step takes a closure and if not manually call it beforehand (like we already do for AMP)?

I think it is a good idea !

@ananyahjha93
Copy link
Contributor Author

@SeanNaren "I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:" - don't understand this,

when you set tpu_cores=1 in the bug script, the TPU accelerator is enabled which calls on_tpu=True. But without the default on_tpu: bool = False in the definition of optimizer_step, it will fail since accelerator.py doesn't call on_tpu=False for optimizer_step. I wouldn't rely on the user getting the exact definition of optimizer_step right (with defaults), when overriding the method.

@ananyahjha93
Copy link
Contributor Author

@tchaton @justusschock I think that might be a good idea, we can have a failing test where optimizer.step() without closure fails to call training_step and then add a fix which makes the test pass.

@SeanNaren
Copy link
Contributor

SeanNaren commented Nov 2, 2020

"I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:" - don't understand this

Offline, the bug report model script you sent never enabled tpu_cores>1

I wouldn't rely on the user getting the exact definition of optimizer_step right (with defaults), when overriding the method.

ah understood, I thought we would expect the user to define override the optimizer_step with defaults.

With more thought I still don't think we should expect the user to override the function without defining the exact function definition they are overriding, which includes defaults. If this is done correctly, then everything works as expected, we can't guarantee the base class is going to run if the overridden function definition isn't the same. @ananyahjha93 thoughts?

we can have a failing test where optimizer.step() without closure fails to call training_step and then add a fix which makes the test pass.

I don't think this is a testable feature, after speaking to Justus they were talking about custom optimizers but in the case of this issue we're discussing overridden optimizer_step. I think in this case we have to just have the doc change to ensure that the user passes in the closure as expected.

EDIT: after speaking to ananya we should remove default set arguments from the lightning module, and enforce parameters to be set correctly in the accelerators.

@SeanNaren SeanNaren added bug Something isn't working and removed working as intended Working as intended labels Nov 2, 2020
@SeanNaren SeanNaren changed the title [docs] Clearer optimizer_step override instructions [bug] [docs] Clearer optimizer_step override instructions Nov 2, 2020
@ananyahjha93 ananyahjha93 changed the title [bug] [docs] Clearer optimizer_step override instructions [bug] Clearer optimizer_step override instructions Nov 2, 2020
@ananyahjha93 ananyahjha93 changed the title [bug] Clearer optimizer_step override instructions [bug] [docs] Clearer optimizer_step override instructions Nov 2, 2020
@SeanNaren SeanNaren marked this pull request as ready for review November 2, 2020 18:32
@SeanNaren
Copy link
Contributor

SeanNaren commented Nov 2, 2020

PR should be ready to review. There are now no defaults set within the module object, and accelerators are responsible for setting arguments correctly.

Function when overridden do not need to contain defaults anymore.

optimizer=optimizer,
optimizer_idx=opt_idx,
optimizer_closure=lambda_closure,
on_tpu=False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_tpu=False, # TPUAccelerator sets this as True

this should be here

optimizer=optimizer,
optimizer_idx=opt_idx,
optimizer_closure=lambda_closure,
on_tpu=False, # TPUAccelerator sets this as True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be set to True like before

@ananyahjha93 ananyahjha93 force-pushed the optimizer_step/training_step branch from 840734e to e5d1391 Compare November 2, 2020 20:34
@ananyahjha93 ananyahjha93 force-pushed the optimizer_step/training_step branch from fef10db to 02ca22e Compare November 2, 2020 20:35
@ananyahjha93 ananyahjha93 force-pushed the optimizer_step/training_step branch from d8202ba to 68ffdb0 Compare November 2, 2020 21:15
@SeanNaren SeanNaren merged commit 01ab2a9 into master Nov 2, 2020
@SeanNaren SeanNaren deleted the optimizer_step/training_step branch November 2, 2020 22:13
@Borda Borda added this to the 1.0.x milestone Nov 3, 2020
Borda pushed a commit that referenced this pull request Nov 3, 2020
* fix

* flags

* remove defaults

(cherry picked from commit 01ab2a9)
Borda pushed a commit that referenced this pull request Nov 4, 2020
* fix

* flags

* remove defaults

(cherry picked from commit 01ab2a9)
rohitgr7 pushed a commit that referenced this pull request Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training_step outputs not propagated
8 participants