Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling for accelerator="mps" and strategy="ddp" #16148

Closed
awaelchli opened this issue Dec 21, 2022 · 3 comments · Fixed by #16153 or #16455
Closed

Error handling for accelerator="mps" and strategy="ddp" #16148

awaelchli opened this issue Dec 21, 2022 · 3 comments · Fixed by #16153 or #16455
Assignees
Labels
accelerator: mps Apple Silicon GPU bug Something isn't working good first issue Good for newcomers
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Dec 21, 2022

Bug description

The MPS backend does not support torch.distributed. We should fail early in the AcceleratorConnector and produce a user friendly error message.

How to reproduce the bug

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    model = BoringModel()
    trainer = Trainer(
        accelerator="mps",
        devices=1,
        strategy="ddp"
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Error messages and logs

Traceback (most recent call last):
  File "/Users/adrian/repositories/lightning/examples/pl_bug_report/bug_report_model.py", line 49, in <module>
    run()
  File "/Users/adrian/repositories/lightning/examples/pl_bug_report/bug_report_model.py", line 45, in run
    trainer.fit(model, train_dataloaders=train_data)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 598, in fit
    call._call_and_handle_interrupt(
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 640, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/trainer/trainer.py", line 1074, in _run
    self.strategy.setup(self)
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 159, in setup
    self._share_information_to_prevent_deadlock()
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 405, in _share_information_to_prevent_deadlock
    self._share_pids()
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/ddp.py", line 423, in _share_pids
    pids = self.all_gather(torch.tensor(os.getpid(), device=self.root_device))
  File "/Users/adrian/repositories/lightning/src/pytorch_lightning/strategies/parallel.py", line 90, in all_gather
    return _all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)
  File "/Users/adrian/repositories/lightning/src/lightning_lite/utilities/distributed.py", line 202, in _all_gather_ddp_if_available
    return _AllGather.apply(tensor, group)
  File "/Users/adrian/repositories/lightning/src/lightning_lite/utilities/distributed.py", line 170, in forward
    torch.distributed.all_gather(gathered_tensor, tensor, group=group)
  File "/Users/adrian/miniconda3/envs/lightning/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: ProcessGroupGloo::allgather: unsupported device type mps

This applies for the entire DDPStrategy family (ddp, ddp_fork, ddp_spawn, etc.)

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningLite
#- PyTorch Lightning Version (e.g., 1.5.0): 1.9.0dev
#- Lightning App Version (e.g., 0.5.2): x
#- PyTorch Version (e.g., 1.10): 1.13
#- Python version (e.g., 3.9): 3.10
#- OS (e.g., Linux): MacOS
#- CUDA/cuDNN version: x
#- GPU models and configuration: x
#- How you installed Lightning(`conda`, `pip`, source): pip 
#- Running environment of LightningApp (e.g. local, cloud): x

More info

No response

cc @Borda @justusschock

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers bug Something isn't working accelerator: mps Apple Silicon GPU and removed needs triage Waiting to be triaged by maintainers labels Dec 21, 2022
@awaelchli awaelchli added this to the v1.8.x milestone Dec 21, 2022
@awaelchli awaelchli added the good first issue Good for newcomers label Dec 21, 2022
@shenoynikhil
Copy link
Contributor

I'd like to take this up!

@awaelchli
Copy link
Contributor Author

Perfect! Ping me here or on the PR if you have any questions

@Borda Borda modified the milestones: v1.8.x, v1.9 Jan 6, 2023
Borda pushed a commit that referenced this issue Jan 12, 2023
…6153)

Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: Nikhil Shenoy <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: awaelchli <[email protected]>
Fixes #16148
@awaelchli
Copy link
Contributor Author

Keeping this open for the pending changes to Fabric.

@shenoynikhil if you are still interested to do the follow up:

The equivalent logic can just be added to the connector in fabric here: src/lightning_fabric/connector.py. It has a very similar structure to the accelerator_connector.py you already modified in the PR, so it should be easy to see where to insert it. And the test would also almost 1:1 be integrated in tests/tests_faric/test_connector.py

Thanks for your help so far, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: mps Apple Silicon GPU bug Something isn't working good first issue Good for newcomers
Projects
None yet
3 participants