You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#SBATCH -N 2
#SBATCH --gres=gpu:volta:1
#SBATCH -c 10
source /etc/profile.d/modules.sh
module load anaconda/2023a
module load cuda/11.6
module load nccl/2.11.4-cuda11.6
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
srun torchrun \
--nnodes 2 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29503 \
multi_tutorial.py 50 10```
However it gives the following error
```Node IP: 172.31.130.84
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : /home/gridsan/rmehta/potential_function/multi_tutorial.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 1
run_id : 7644
rdzv_backend : c10d
rdzv_endpoint : 172.31.130.84:29503
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : /home/gridsan/rmehta/potential_function/multi_tutorial.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 1
run_id : 7644
rdzv_backend : c10d
rdzv_endpoint : 172.31.130.84:29503
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:426] [c10d] The server socket has failed to listen on [::]:29503 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29503 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=d-9-11-1.supercloud.mit.edu
master_port=58139
group_rank=0
group_world_size=2
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[2]
global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=d-9-11-1.supercloud.mit.edu
master_port=58139
group_rank=1
group_world_size=2
local_ranks=[0]
role_ranks=[1]
global_ranks=[1]
role_world_sizes=[2]
global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1/attempt_0/0/error.json
d-9-11-1:2870757:2870757 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870757 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870757:2870757 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.14.3+cuda11.6
d-9-11-1:2870756:2870756 [0] NCCL INFO cudaDriverVersion 12020
d-9-11-1:2870756:2870756 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870756 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870756:2870934 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870934 [0] NCCL INFO Using network IB
d-9-11-1:2870757:2870933 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870933 [0] NCCL INFO Using network IB
d-9-11-1:2870757:2870933 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
d-9-11-1:2870757:2870933 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870757:2870933 [0] NCCL INFO group.cc:64 -> 5 [Async thread]
d-9-11-1:2870756:2870934 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
d-9-11-1:2870756:2870934 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870756:2870934 [0] NCCL INFO group.cc:64 -> 5 [Async thread]
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO comm 0x560c6a8aafd0 rank 0 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
d-9-11-1:2870756:2870756 [0] NCCL INFO comm 0x55592676f080 rank 1 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
Traceback (most recent call last):
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
Traceback (most recent call last):
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
main(args.save_every, args.total_epochs, args.batch_size)
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
self.model = DDP(self.model, device_ids=[self.local_rank])
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
main(args.save_every, args.total_epochs, args.batch_size)
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
self.model = DDP(self.model, device_ids=[self.local_rank])
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
_verify_param_shape_across_processes(self.process_group, parameters)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870757) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870756) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00014710426330566406 seconds
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004813671112060547 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
Traceback (most recent call last):
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
sys.exit(main())
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
return f(*args, **kwargs)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
run(args)
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
elastic_launch(
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
return launch_agent(self._config, self._entrypoint, list(args))
File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-24_11:39:43
host : d-9-11-1.supercloud.mit.edu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2870757)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-24_11:39:43
host : d-9-11-1.supercloud.mit.edu
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 2870756)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: d-9-11-1: tasks 0-1: Exited with exit code 1
I am unsure whether the error is it failing to connect to the port, and this causes the downstream error of the different processes trying to use the same GPU, or if these are two separate errors. I have tried using many different ports, but they all give the same failed to connect error. Again, my code is identical to the one in the multinode.py example. I would appreciate any help trying to get to the bottom of this. Thank you.
The text was updated successfully, but these errors were encountered:
I am using the code from the multinode.py (from this DDP tutorial series https://www.youtube.com/watch?v=KaAJtI1T2x4) file with the following Slurm Script
I am unsure whether the error is it failing to connect to the port, and this causes the downstream error of the different processes trying to use the same GPU, or if these are two separate errors. I have tried using many different ports, but they all give the same failed to connect error. Again, my code is identical to the one in the multinode.py example. I would appreciate any help trying to get to the bottom of this. Thank you.
The text was updated successfully, but these errors were encountered: