You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
会报错:
master报错:
(llamafactory) linewell@aifs3-master-1:/datas/work/hecheng/LLaMA-Factory1/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:38:36,034] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:38:40 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757]
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llamafactory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 551, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 638, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 675, in _share_and_gather
role_infos_bytes = store_util.synchronize(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistStoreError: Socket Timeout
worker报错:
(llama-factory) linewell@aifs3-worker-1:/datas/work/hecheng/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:39:19,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:39:23 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779]
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
[rank1]:[E916 17:44:13.419203922 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank1]:[E916 17:44:13.423659979 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]: Traceback (most recent call last):
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2046, in backward
[rank1]: self.allreduce_gradients()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1965, in allreduce_gradients
[rank1]: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 868, in overlapping_partition_gradients_reduce_epilogue
[rank1]: self.independent_gradient_partition_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in independent_gradient_partition_epilogue
[rank1]: get_accelerator().synchronize()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 79, in synchronize
[rank1]: return torch.cuda.synchronize(device_index)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/cuda/init.py", line 892, in synchronize
[rank1]: return torch._C._cuda_synchronize()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: KeyboardInterrupt
[rank1]:[E916 17:44:14.559958087 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]:[E916 17:44:14.560003542 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E916 17:44:14.560015443 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E916 17:44:14.562376704 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7fab53e78a84 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llama-factory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 502, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 616, in _assign_worker_ranks
store.get(f"{ASSIGNED_RANKS_PREFIX}{group_rank}")
torch.distributed.DistNetworkError: Connection reset by peer
Expected behavior
正常运行
Others
正常运行
The text was updated successfully, but these errors were encountered:
Reminder
System Info
v0.8.3 版本
Reproduction
FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
会报错:
master报错:
(llamafactory) linewell@aifs3-master-1:/datas/work/hecheng/LLaMA-Factory1/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:38:36,034] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:38:40 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757]
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:38:41.930000 139955215721408 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llamafactory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 551, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 638, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 675, in _share_and_gather
role_infos_bytes = store_util.synchronize(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llamafactory/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistStoreError: Socket Timeout
worker报错:
(llama-factory) linewell@aifs3-worker-1:/datas/work/hecheng/LLaMA-Factory$ FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.175.4 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-09-16 17:39:19,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/16/2024 17:39:23 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.175.4:29500
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779]
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0916 17:39:24.555000 139914969891776 torch/distributed/run.py:779] *****************************************
[rank1]:[E916 17:44:13.419203922 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank1]:[E916 17:44:13.423659979 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]: Traceback (most recent call last):
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/datas/work/hecheng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2046, in backward
[rank1]: self.allreduce_gradients()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1965, in allreduce_gradients
[rank1]: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 868, in overlapping_partition_gradients_reduce_epilogue
[rank1]: self.independent_gradient_partition_epilogue()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in independent_gradient_partition_epilogue
[rank1]: get_accelerator().synchronize()
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 79, in synchronize
[rank1]: return torch.cuda.synchronize(device_index)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/cuda/init.py", line 892, in synchronize
[rank1]: return torch._C._cuda_synchronize()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: KeyboardInterrupt
[rank1]:[E916 17:44:14.559958087 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 616, last enqueued NCCL work: 616, last completed NCCL work: 615.
[rank1]:[E916 17:44:14.560003542 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E916 17:44:14.560015443 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E916 17:44:14.562376704 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=616, OpType=ALLREDUCE, NumelIn=17891328, NumelOut=17891328, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab541e68f2 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fab541ed333 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab541ef71c in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fab52ee9f86 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7fab53e78a84 in /home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xbd6df (0x7faba19396df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x76db (0x7faba39776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7faba2efb61f in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/home/linewell/miniforge3/envs/llama-factory/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 502, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/linewell/miniforge3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 616, in _assign_worker_ranks
store.get(f"{ASSIGNED_RANKS_PREFIX}{group_rank}")
torch.distributed.DistNetworkError: Connection reset by peer
Expected behavior
正常运行
Others
正常运行
The text was updated successfully, but these errors were encountered: