Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module 'torch.distributed' has no attribute 'ProcessGroup' when importing PyTorch Lightning #10348

Closed
Riccorl opened this issue Nov 4, 2021 · 24 comments · Fixed by #10359, #10418 or #10621
Closed
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@Riccorl
Copy link

Riccorl commented Nov 4, 2021

🐛 Bug

When importing PyTorch Lightning, it throws an AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'. I guess it comes from the fact that I am on macOS (M1) and PyTorch does not provide torch.distributed with its pre-built package. Indeed, torch.distributed.is_available() is False.

To Reproduce

import pytorch_lightning

Environment

  • PyTorch Lightning Version: 1.5.0
  • PyTorch Version: 1.10
  • Python version: 3.9
  • OS: macOS
  • How you installed PyTorch: conda
  • Any other relevant information:
@Riccorl Riccorl added bug Something isn't working help wanted Open to be worked on labels Nov 4, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Nov 4, 2021

Hello @Riccorl
Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?

@awaelchli
Copy link
Contributor

I'm not sure why you get torch.distributed.is_available() = False on MacOS, it should be True. It is for me.

@Riccorl
Copy link
Author

Riccorl commented Nov 4, 2021

Hello @Riccorl Thanks for reporting. Can you please show us the full error so we can check in which module it occurs?

This is the stack trace

Traceback (most recent call last):
  File "/Users/ric/Documents/PhD/Projects/invero-xl/invero_xl/train.py", line 7, in <module>
    import pytorch_lightning as pl
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 39, in <module>
    from pytorch_lightning.trainer.connectors.logger_connector.fx_validator import _FxValidator
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 30, in <module>
    from pytorch_lightning.accelerators import Accelerator, IPUAccelerator
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 13, in <module>
    from pytorch_lightning.accelerators.accelerator import Accelerator  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 26, in <module>
    from pytorch_lightning.plugins.precision import ApexMixedPrecisionPlugin, NativeMixedPrecisionPlugin, PrecisionPlugin
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 8, in <module>
    from pytorch_lightning.plugins.plugins_registry import (  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/plugins_registry.py", line 20, in <module>
    from pytorch_lightning.plugins.training_type.training_type_plugin import TrainingTypePlugin
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin  # noqa: F401
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 68, in <module>
    from torch.distributed.optim import DistributedOptimizer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/__init__.py", line 37, in <module>
    from .post_localSGD_optimizer import PostLocalSGDOptimizer
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/optim/post_localSGD_optimizer.py", line 2, in <module>
    import torch.distributed.algorithms.model_averaging.averagers as averagers
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
    import torch.distributed.algorithms.model_averaging.utils as utils
  File "/Users/ric/mambaforge/envs/srl/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
    params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'

I'm not sure why you get torch.distributed.is_available() = False on MacOS, it should be True. It is for me.

I installed PyTorch like this:

conda install pytorch -c pytorch

But I guess that the problem is the ARM build (I'm on an M1 cpu).

@awaelchli awaelchli added the priority: 0 High priority task label Nov 4, 2021
@awaelchli awaelchli added this to the 1.5.x milestone Nov 4, 2021
@carmocca
Copy link
Contributor

carmocca commented Nov 4, 2021

We can fix this easily as the error comes from a typing annotation, but we'll also have to add a M1 CI job when it becomes available.

@four4fish
Copy link
Contributor

four4fish commented Nov 4, 2021

Current init_dist_connection() will do nothing if torch.distributed.is_avalible = False. To wrap model with DistributedDataParallel(), something like torch.distributed.init_process_group( backend='nccl', world_size=N, init_method='...' )
Is required right?
Should we raise exceptions at set_distributed() in ddp if torch.distributed.is_avalible = False?

@awaelchli
Copy link
Contributor

@four4fish Yes you are right. I tested it and we get this message:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I agree, for users like @Riccorl we could inform them directly that DDP is not available when torch.distributed.is_avalible = False.

Btw, doesn't your PR directly solve the import problem of this issue? I couldn't find other places.

@anuragsingh31
Copy link

@Riccorl I had a similar issue on M1 macbook, it only happens with pytorch=1.10.
Downgrading torch to '1.9.1.post3' resolved the issue for me.

@four4fish
Copy link
Contributor

@awaelchli I think after the import PR, import lightning didn't fail. But when trainer calls ddp setup_distributed(), which calls init_dist_connection() will check torch.distributed.is_avalible before create process group. Because torch.distributed.is_avalible=False no process_group was created, so the future ddp will fail. Where is this runtime error happens exactly? when wrap the model?

I was proposing: should we throw exception in init_dist_connect() If torch.distributed.is_avalible=false ?

@adamjstewart
Copy link
Contributor

I encountered this same issue. I'm building PyTorch Lightning 1.5.0 and PyTorch 1.10.0 from source using the Spack package manager on macOS 10.15.7. Unfortunately, PyTorch distributed doesn't seem to build for me on macOS: pytorch/pytorch#68002

It sounds like requiring distributed support was an accident and will be removed in future releases. Let me know which PR solves this and I'll add a patch to the 1.5.0 release in Spack.

@carmocca
Copy link
Contributor

carmocca commented Nov 9, 2021

@four4fish your PR (#10418) says "partially fixes".

Do we need to re-open this? What's left for us to do here?

@adamjstewart
Copy link
Contributor

I just tried again with PyTorch Lightning 1.5.2 and I'm still seeing numerous issues if PyTorch isn't installed with distributed support.

@awaelchli awaelchli reopened this Nov 18, 2021
@justusschock
Copy link
Member

justusschock commented Nov 18, 2021

@adamjstewart I also tested this with PL 1.5.2 and I had no issues. Can you give us your torch version and a reproducible script?

@adamjstewart
Copy link
Contributor

@justusschock sure, my environment looks like:

  • PyTorch Lightning Version: 1.5.2
  • PyTorch Version: 1.10.0
  • Python version: 3.8.12
  • OS: macOS
  • How you installed PyTorch: spack

In order to reproduce this issue, PyTorch must be installed without distributed support:

$ python
>>> import torch
>>> torch.distributed.is_available()
False

This is commonly the case on macOS. Then, the issue (which now looks different than it did in 1.5.0) can be reproduced like so:

$ python
>>> from pytorch_lightning.core.lightning import LightningModule
>>> LightningModule()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 122, in __init__
    self._register_sharded_tensor_state_dict_hooks_if_available()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 2065, in _register_sharded_tensor_state_dict_hooks_if_available
    from torch.distributed._sharded_tensor import pre_load_state_dict_hook, state_dict_hook
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharded_tensor/__init__.py", line 5, in <module>
    from torch.distributed._sharding_spec import (
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/__init__.py", line 1, in <module>
    from .api import (
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 21, in <module>
    class DevicePlacementSpec(PlacementSpec):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/distributed/_sharding_spec/api.py", line 29, in DevicePlacementSpec
    device: torch.distributed._remote_device
AttributeError: module 'torch.distributed' has no attribute '_remote_device'

@ananthsub
Copy link
Contributor

That error arises due to the automatic registration support for sharded tensors here: https://github.com/PyTorchLightning/pytorch-lightning/blob/2c7c4aab8087d4c1c99c57c7acc66ef9a8e815d4/pytorch_lightning/core/lightning.py#L1988-L1994

We should check if torch distributed is available before importing in that function's implementation

@adamjstewart
Copy link
Contributor

Just wanted to follow up on this and say that all issues I was encountering with non-distributed PyTorch seem to be fixed in 1.5.3. Thanks @ananthsub @four4fish and everyone else involved in fixing these!

@AdirRahamim
Copy link

@adamjstewart I'm using Mac with M1 and version 1.5.3 and still get error: ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' when trying to import pytorch_lightning, have you done anything else in order to solve this?

@adamjstewart
Copy link
Contributor

Hmm, 1.5.3 just worked for me, no hacks required. Are you sure you're using 1.5.3? You might be hitting a different part of the code than me. Can you share the error message stack trace?

@AdirRahamim
Copy link

AdirRahamim commented Nov 28, 2021

Yes, I'm sure I use 1.5.3 this is the stack trace for trying to import pytorch lightning :

Traceback (most recent call last):
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
import pytorch_lightning as pl
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/init.py", line 20, in
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/init.py", line 14, in
from pytorch_lightning.callbacks.base import Callback
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/callbacks/base.py", line 26, in
from pytorch_lightning.utilities.types import STEP_OUTPUT
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/init.py", line 18, in
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in
from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 82, in
_FAIRSCALE_AVAILABLE = not _IS_WINDOWS and _module_available("fairscale.nn")
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 38, in _module_available
return find_spec(module_path) is not None
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/importlib/util.py", line 94, in find_spec
parent = import(parent_name, fromlist=['path'])
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/init.py", line 15, in
from . import nn
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/init.py", line 9, in
from .data_parallel import FullyShardedDataParallel, ShardedDataParallel
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/init.py", line 8, in
from .fully_sharded_data_parallel import FullyShardedDataParallel, TrainingState, auto_wrap_bn
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 34, in
from torch.distributed import ProcessGroup
ImportError: cannot import name 'ProcessGroup' from 'torch.distributed' (/Users/adir.rahamim/miniforge3/envs/cca/lib/python3.8/site-packages/torch/distributed/init.py)

@carmocca
Copy link
Contributor

@AdirRahamim that's caused by the same problem described in this issue but for the fairscale repository: https://github.com/facebookresearch/fairscale

You can raise this issue on their repository. You can also uninstall the dependency assuming you are not using it. Uninstalling it means it will not get imported so you won't get the failure.

pip uninstall fairscale

@AdirRahamim
Copy link

@carmocca Thanks! indeed uninstalling the package solved the problem.

@schiegl
Copy link

schiegl commented Apr 7, 2022

I'm still experiencing this issue on PyTorch lightning v1.6.0 and PyTorch v1.11.0. Furthermore, torch.distributed.is_available() evaluates to False. Does this have something to do with the fact that I installed the dependencies with miniforge and therefore from conda-forge?

@carmocca
Copy link
Contributor

@schiegl can you share the full error stacktrace?

@schiegl
Copy link

schiegl commented Apr 11, 2022

@carmocca This is the stack trace I get when I import PyTorch lightning with the following environment.yml

name: pl_error
channels:
  - defaults
  - pytorch
  - conda-forge

dependencies:
  - python=3.9
  - numpy=1.21.2
  - pytorch=1.11
  - pytorch-lightning=1.6

Import error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 30, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/__init__.py", line 26, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/callbacks/pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 41, in <module>
    from pytorch_lightning.trainer.connectors.data_connector import _DataHookSelector
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/__init__.py", line 16, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 34, in <module>
    from pytorch_lightning.accelerators import Accelerator, GPUAccelerator, HPUAccelerator, IPUAccelerator, TPUAccelerator
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/__init__.py", line 14, in <module>
    from pytorch_lightning.accelerators.cpu import CPUAccelerator  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/accelerators/cpu.py", line 19, in <module>
    from pytorch_lightning.utilities import device_parser
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 18, in <module>
    from pytorch_lightning.plugins.environments import TorchElasticEnvironment
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/__init__.py", line 20, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/__init__.py", line 1, in <module>
    from pytorch_lightning.plugins.training_type.ddp import DDPPlugin  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 14, in <module>
    from pytorch_lightning.strategies import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/__init__.py", line 14, in <module>
    from pytorch_lightning.strategies.bagua import BaguaStrategy  # noqa: F401
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/bagua.py", line 17, in <module>
    from pytorch_lightning.strategies.ddp import DDPStrategy
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 66, in <module>
    from torch.distributed.algorithms.model_averaging.averagers import ModelAverager
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/averagers.py", line 5, in <module>
    import torch.distributed.algorithms.model_averaging.utils as utils
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pl_error/lib/python3.9/site-packages/torch/distributed/algorithms/model_averaging/utils.py", line 10, in <module>
    params: Iterator[torch.nn.Parameter], process_group: dist.ProcessGroup
AttributeError: module 'torch.distributed' has no attribute 'ProcessGroup'

@JasonTam
Copy link

@schiegl @carmocca
fwiw, I was also facing this issue on 1.6.0
Downgrading to 1.5.3 fixed it for me though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment