-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] [docs] Clearer optimizer_step override instructions #4455
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4455 +/- ##
======================================
Coverage 92% 92%
======================================
Files 116 116
Lines 8700 8700
======================================
Hits 8044 8044
Misses 656 656 |
@SeanNaren is adding the test for this. |
hey @ananyahjha93, I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding import os
from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
from pytorch_lightning.utilities import AMPType
import torch
from pytorch_lightning import LightningModule
import torch_xla.core.xla_model as xm
class RandomDataset(Dataset):
def __init__(self, size, num_samples):
self.len = num_samples
self.data = torch.randn(num_samples, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def validation_epoch_end(self, outputs) -> None:
torch.stack([x['x'] for x in outputs]).mean()
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('fake_test_acc', loss)
return {"y": loss}
def test_epoch_end(self, outputs) -> None:
torch.stack([x["y"] for x in outputs]).mean()
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def optimizer_step(
self,
epoch: int,
batch_idx: int,
optimizer,
optimizer_idx: int,
optimizer_closure=None,
on_tpu: bool = False,
using_native_amp: bool = False,
using_lbfgs: bool = False
) -> None:
if on_tpu:
xm.optimizer_step(optimizer, optimizer_args={'closure': optimizer_closure})
elif self.trainer.amp_backend == AMPType.NATIVE:
# native amp does not yet support closures.
# TODO: pass the closure to the step ASAP
optimizer_closure()
self.trainer.scaler.step(optimizer)
elif self.trainer.amp_backend == AMPType.APEX:
# apex amp does not yet support closures.
# TODO: pass the closure to the step ASAP
optimizer_closure()
optimizer.step()
else:
optimizer.step(optimizer_closure)
def test_x(tmpdir):
num_samples = 10000
train = RandomDataset(32, num_samples)
train = DataLoader(train, batch_size=32)
val = RandomDataset(32, num_samples)
val = DataLoader(val, batch_size=32)
test = RandomDataset(32, num_samples)
test = DataLoader(test, batch_size=32)
# init model
model = BoringModel()
# Initialize a trainer
trainer = pl.Trainer(
max_epochs=1,
progress_bar_refresh_rate=1,
tpu_cores=1
)
# Train the model ⚡
trainer.fit(model, train, val)
trainer.test(test_dataloaders=test)
tmpdir = os.getcwd()
test_x(tmpdir) Also when overriding optimizer_step, we should expect the user override the exact same function method with defaults (as I've done above). I suggest removing the code changes and keeping documentation fixes in! |
@ananyahjha93 What do you think if we check if optimizer_step takes a closure and if not manually call it beforehand (like we already do for AMP)? |
I think it is a good idea ! |
@SeanNaren "I think your bug model script doesn't set TPU cores greater than 1, so TPUAccelerator is never enabled. This is what led to adding on_tpu=false in the accelerator, which isn't correct. Here is the fixed script:" - don't understand this, when you set |
@tchaton @justusschock I think that might be a good idea, we can have a failing test where optimizer.step() without closure fails to call training_step and then add a fix which makes the test pass. |
Offline, the bug report model script you sent never enabled
ah understood, I thought we would expect the user to define override the optimizer_step with defaults. With more thought I still don't think we should expect the user to override the function without defining the exact function definition they are overriding, which includes defaults. If this is done correctly, then everything works as expected, we can't guarantee the base class is going to run if the overridden function definition isn't the same. @ananyahjha93 thoughts?
I don't think this is a testable feature, after speaking to Justus they were talking about custom optimizers but in the case of this issue we're discussing overridden EDIT: after speaking to ananya we should remove default set arguments from the lightning module, and enforce parameters to be set correctly in the accelerators. |
PR should be ready to review. There are now no defaults set within the module object, and accelerators are responsible for setting arguments correctly. Function when overridden do not need to contain defaults anymore. |
optimizer=optimizer, | ||
optimizer_idx=opt_idx, | ||
optimizer_closure=lambda_closure, | ||
on_tpu=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_tpu=False, # TPUAccelerator sets this as True
this should be here
optimizer=optimizer, | ||
optimizer_idx=opt_idx, | ||
optimizer_closure=lambda_closure, | ||
on_tpu=False, # TPUAccelerator sets this as True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be set to True like before
840734e
to
e5d1391
Compare
fef10db
to
02ca22e
Compare
d8202ba
to
68ffdb0
Compare
* fix * flags * remove defaults (cherry picked from commit 01ab2a9)
* fix * flags * remove defaults (cherry picked from commit 01ab2a9)
* fix * flags * remove defaults
What does this PR do?
Need to add a failing test first in the master.
bug: when the user overrides
optimizer_step()
function and callsoptimizer.step()
without usingoptimizer_closure
parameter, the code crashes for 2 reasons:train_step_and_backward_closure
as optimizer_closure, which is results in not calling training_step if the user overrides optimizer_step() in LightningModule and does not use optimizer.step(closure=optimizer_closure)Updates the docs to indicate that the user must pass
optimizer_closure
param tooptimizer.step()
when overriding optimizer_step function. This is required since training_loop.py definestrain_step_and_backward_closure()
withinrun_training_batch()
Fixes #4452, #4447.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃