Error handling for `accelerator="mps"` and `strategy="ddp"` #16148

awaelchli · 2022-12-21T02:04:54Z

Bug description

The MPS backend does not support torch.distributed. We should fail early in the AcceleratorConnector and produce a user friendly error message.

How to reproduce the bug

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    model = BoringModel()
    trainer = Trainer(
        accelerator="mps",
        devices=1,
        strategy="ddp"
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Error messages and logs

Traceback (most recent call last):
  File "/Users/adrian/repositories/lightning/examples/pl_bug_report/bug_report_model.py", line 49, in <module>
    run()
  File "/Users/adrian/repositories/lightning/examples/pl_bug_report/bug_report_model.py", line 45, in run
    trainer.fit(model, train_dataloaders=train_data)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 598, in fit
    call._call_and_handle_interrupt(
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 640, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 1074, in _run
    self.strategy.setup(self)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 159, in setup
    self._share_information_to_prevent_deadlock()
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 405, in _share_information_to_prevent_deadlock
    self._share_pids()
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 423, in _share_pids
    pids = self.all_gather(torch.tensor(os.getpid(), device=self.root_device))
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/parallel.py", line 90, in all_gather
    return _all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)
  File "/Users/adrian/repositories/lightning/src/lightning_lite/utilities/distributed.py", line 202, in _all_gather_ddp_if_available
    return _AllGather.apply(tensor, group)
  File "/Users/adrian/repositories/lightning/src/lightning_lite/utilities/distributed.py", line 170, in forward
    torch.distributed.all_gather(gathered_tensor, tensor, group=group)
  File "/Users/adrian/miniconda3/envs/lightning/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: ProcessGroupGloo::allgather: unsupported device type mps

This applies for the entire DDPStrategy family (ddp, ddp_fork, ddp_spawn, etc.)

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningLite
#- PyTorch Lightning Version (e.g., 1.5.0): 1.9.0dev
#- Lightning App Version (e.g., 0.5.2): x
#- PyTorch Version (e.g., 1.10): 1.13
#- Python version (e.g., 3.9): 3.10
#- OS (e.g., Linux): MacOS
#- CUDA/cuDNN version: x
#- GPU models and configuration: x
#- How you installed Lightning(`conda`, `pip`, source): pip 
#- Running environment of LightningApp (e.g. local, cloud): x

More info

No response

cc @Borda @justusschock

The text was updated successfully, but these errors were encountered:

shenoynikhil · 2022-12-21T04:42:27Z

I'd like to take this up!

awaelchli · 2022-12-21T04:46:12Z

Perfect! Ping me here or on the PR if you have any questions

…6153) Co-authored-by: Justus Schock <[email protected]> Co-authored-by: Nikhil Shenoy <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: awaelchli <[email protected]> Fixes #16148

awaelchli · 2023-01-12T13:14:13Z

Keeping this open for the pending changes to Fabric.

@shenoynikhil if you are still interested to do the follow up:

The equivalent logic can just be added to the connector in fabric here: src/lightning_fabric/connector.py. It has a very similar structure to the accelerator_connector.py you already modified in the PR, so it should be easy to see where to insert it. And the test would also almost 1:1 be integrated in tests/tests_faric/test_connector.py

Thanks for your help so far, cheers!

awaelchli added needs triage Waiting to be triaged by maintainers bug Something isn't working accelerator: mps Apple Silicon GPU and removed needs triage Waiting to be triaged by maintainers labels Dec 21, 2022

awaelchli added this to the v1.8.x milestone Dec 21, 2022

awaelchli added the good first issue Good for newcomers label Dec 21, 2022

awaelchli assigned shenoynikhil Dec 21, 2022

shenoynikhil mentioned this issue Dec 21, 2022

Error handling for accelerator="mps" and ddp strategy pairing #16153

Merged

12 tasks

Borda modified the milestones: v1.8.x, v1.9 Jan 6, 2023

Borda closed this as completed in #16153 Jan 12, 2023

awaelchli reopened this Jan 12, 2023

shenoynikhil mentioned this issue Jan 20, 2023

LightningFabric: Error handling for accelerator="mps" and ddp strategy pairing #16455

Merged

12 tasks

carmocca closed this as completed in #16455 Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling for `accelerator="mps"` and `strategy="ddp"` #16148

Error handling for `accelerator="mps"` and `strategy="ddp"` #16148

awaelchli commented Dec 21, 2022 •

edited

Loading

shenoynikhil commented Dec 21, 2022

awaelchli commented Dec 21, 2022

awaelchli commented Jan 12, 2023

Error handling for accelerator="mps" and strategy="ddp" #16148

Error handling for accelerator="mps" and strategy="ddp" #16148

Comments

awaelchli commented Dec 21, 2022 • edited Loading

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

shenoynikhil commented Dec 21, 2022

awaelchli commented Dec 21, 2022

awaelchli commented Jan 12, 2023

Error handling for `accelerator="mps"` and `strategy="ddp"` #16148

Error handling for `accelerator="mps"` and `strategy="ddp"` #16148

awaelchli commented Dec 21, 2022 •

edited

Loading