You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use HivemindStrategy to train a ResNet model on Cifar-10 using two machines (one w/ a gpu and the other no).
I start the CPU machine first, and the training starts without a problem. Then, I copy the initial_peers value to the GPU machine, start training but it fails.
The first machine (without gpu) proceeds with the training normally, here's a sample of its output:
Other machines can connect running the same command:
INITIAL_PEERS=/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ python ...
or passing the peers to the strategy:
HivemindStrategy(initial_peers='/ip4/135.181.202.15/tcp/34483/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ,/ip4/135.181.202.15/udp/51862/quic/p2p/12D3KooWKbia9ZD4ayLseSkK8QfxaSqYUwSC6fYibfP8crGRBEaJ')
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Files already downloaded and verified
Files already downloaded and verified
| Name | Type | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M Trainable params
0 Non-trainable params
11.2 M Total params
44.696 Total estimated model params size (MB)
Epoch 0: 0%| | 0/3125 [00:00<?, ?it/s]Found per machine batch size automatically from the batch: 16
Epoch 0: 2%|██▏ | 48/3125 [00:11<12:03, 4.25it/s, loss=2.37, v_num=7]
The other machine, however, fails:
/opt/conda/lib/python3.10/site-packages/pl_bolts/callbacks/data_monitor.py:20: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
warn_missing_pkg("wandb")
/opt/conda/lib/python3.10/site-packages/pl_bolts/utils/semi_supervised.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
warn_missing_pkg("sklearn", pypi_name="scikit-learn")
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:35: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
"lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/opt/conda/lib/python3.10/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:93: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/opt/conda/lib/python3.10/site-packages/pl_bolts/losses/self_supervised_learning.py:234: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
self.nce_loss = AmdimNCELoss(tclip)
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/experience_source.py:18: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
warn_missing_pkg("gym")
/opt/conda/lib/python3.10/site-packages/pl_bolts/datamodules/sklearn_datamodule.py:15: UnderReviewWarning: The feature warn_missing_pkg is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
warn_missing_pkg("sklearn")
Global seed set to 7
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
Traceback (most recent call last):
File "/workspace/cifar10.py", line 122, in <module>
strategy=HivemindStrategy(target_batch_size=2048,
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/hivemind.py", line 142, in __init__
self.dht = hivemind.DHT(
File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 88, in __init__
self.run_in_background(await_ready=await_ready)
File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 148, in run_in_background
self.wait_until_ready(timeout)
File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/dht.py", line 151, in wait_until_ready
self._ready.result(timeout=timeout)
File "/opt/conda/lib/python3.10/site-packages/hivemind/utils/mpfuture.py", line 258, in result
return super().result(timeout)
File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2022/12/22 20:13:57 failed to parse multiaddr "": empty multiaddr
Bug description
I'm trying to use
HivemindStrategy
to train a ResNet model on Cifar-10 using two machines (one w/ a gpu and the other no).I start the CPU machine first, and the training starts without a problem. Then, I copy the initial_peers value to the GPU machine, start training but it fails.
How to reproduce the bug
Error messages and logs
The first machine (without gpu) proceeds with the training normally, here's a sample of its output:
The other machine, however, fails:
Environment
CPU machine:
GPU machine:
More info
The text was updated successfully, but these errors were encountered: