-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module 'torch.distributed' has no attribute 'ProcessGroup' when importing PyTorch Lightning #10348
Comments
Hello @Riccorl |
I'm not sure why you get |
This is the stack trace
I installed PyTorch like this:
But I guess that the problem is the ARM build (I'm on an M1 cpu). |
We can fix this easily as the error comes from a typing annotation, but we'll also have to add a M1 CI job when it becomes available. |
Current init_dist_connection() will do nothing if torch.distributed.is_avalible = False. To wrap model with |
@four4fish Yes you are right. I tested it and we get this message:
I agree, for users like @Riccorl we could inform them directly that DDP is not available when Btw, doesn't your PR directly solve the import problem of this issue? I couldn't find other places. |
@Riccorl I had a similar issue on M1 macbook, it only happens with pytorch=1.10. |
@awaelchli I think after the import PR, import lightning didn't fail. But when trainer calls ddp setup_distributed(), which calls init_dist_connection() will check I was proposing: should we throw exception in init_dist_connect() If |
I encountered this same issue. I'm building PyTorch Lightning 1.5.0 and PyTorch 1.10.0 from source using the Spack package manager on macOS 10.15.7. Unfortunately, PyTorch distributed doesn't seem to build for me on macOS: pytorch/pytorch#68002 It sounds like requiring distributed support was an accident and will be removed in future releases. Let me know which PR solves this and I'll add a patch to the 1.5.0 release in Spack. |
@four4fish your PR (#10418) says "partially fixes". Do we need to re-open this? What's left for us to do here? |
I just tried again with PyTorch Lightning 1.5.2 and I'm still seeing numerous issues if PyTorch isn't installed with distributed support. |
@adamjstewart I also tested this with PL 1.5.2 and I had no issues. Can you give us your torch version and a reproducible script? |
@justusschock sure, my environment looks like:
In order to reproduce this issue, PyTorch must be installed without distributed support: $ python
>>> import torch
>>> torch.distributed.is_available()
False This is commonly the case on macOS. Then, the issue (which now looks different than it did in 1.5.0) can be reproduced like so: $ python
>>> from pytorch_lightning.core.lightning import LightningModule
>>> LightningModule()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 122, in __init__
self._register_sharded_tensor_state_dict_hooks_if_available()
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 2065, in _register_sharded_tensor_state_dict_hooks_if_available
from torch.distributed._sharded_tensor import pre_load_state_dict_hook, state_dict_hook
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharded_tensor/__init__.py", line 5, in <module>
from torch.distributed._sharding_spec import (
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/__init__.py", line 1, in <module>
from .api import (
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 21, in <module>
class DevicePlacementSpec(PlacementSpec):
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 29, in DevicePlacementSpec
device: torch.distributed._remote_device
AttributeError: module 'torch.distributed' has no attribute '_remote_device' |
That error arises due to the automatic registration support for sharded tensors here: https://github.com/PyTorchLightning/pytorch-lightning/blob/2c7c4aab8087d4c1c99c57c7acc66ef9a8e815d4/pytorch_lightning/core/lightning.py#L1988-L1994 We should check if torch distributed is available before importing in that function's implementation |
Just wanted to follow up on this and say that all issues I was encountering with non-distributed PyTorch seem to be fixed in 1.5.3. Thanks @ananthsub @four4fish and everyone else involved in fixing these! |
@adamjstewart I'm using Mac with M1 and version 1.5.3 and still get error: ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' when trying to import pytorch_lightning, have you done anything else in order to solve this? |
Hmm, 1.5.3 just worked for me, no hacks required. Are you sure you're using 1.5.3? You might be hitting a different part of the code than me. Can you share the error message stack trace? |
Yes, I'm sure I use 1.5.3 this is the stack trace for trying to import pytorch lightning :
|
@AdirRahamim that's caused by the same problem described in this issue but for the You can raise this issue on their repository. You can also uninstall the dependency assuming you are not using it. Uninstalling it means it will not get imported so you won't get the failure.
|
@carmocca Thanks! indeed uninstalling the package solved the problem. |
I'm still experiencing this issue on PyTorch lightning v1.6.0 and PyTorch v1.11.0. Furthermore, |
@schiegl can you share the full error stacktrace? |
@carmocca This is the stack trace I get when I import PyTorch lightning with the following name: pl_error
channels:
- defaults
- pytorch
- conda-forge
dependencies:
- python=3.9
- numpy=1.21.2
- pytorch=1.11
- pytorch-lightning=1.6 Import error
|
🐛 Bug
When importing PyTorch Lightning, it throws an
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'
. I guess it comes from the fact that I am on macOS (M1) and PyTorch does not providetorch.distributed
with its pre-built package. Indeed,torch.distributed.is_available()
isFalse
.To Reproduce
Environment
conda
The text was updated successfully, but these errors were encountered: