distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

hiennguyennq · 2024-10-21T18:31:40Z

I finetuned the following config and have a bug related NVVC

model

model_name_or_path: Qwen/Qwen2-1.5B-Instruct

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: identity,alpaca_en_demo
template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.195.236 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft
_ds3.yaml
[2024-10-21 18:28:11,947] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
10/21/2024 18:28:15 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.195.236:29500
W1021 18:28:16.888000 16321 site-packages/torch/distributed/run.py:793] 
W1021 18:28:16.888000 16321 site-packages/torch/distributed/run.py:793] *****************************************
W1021 18:28:16.888000 16321 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 18:28:16.888000 16321 site-packages/torch/distributed/run.py:793] *****************************************
[2024-10-21 18:28:21,690] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:28:21,709] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:28:23,555] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:28:23,607] [INFO] [comm.py:652:init_distributed] cdb=None
10/21/2024 18:28:24 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:675] 2024-10-21 18:28:24,461 >> loading configuration file config.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/config.json
[INFO|configuration_utils.py:742] 2024-10-21 18:28:24,464 >> Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-1.5B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.45.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

10/21/2024 18:28:24 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,671 >> loading file vocab.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/vocab.json
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,671 >> loading file merges.txt from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/merges.txt
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,671 >> loading file tokenizer.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/tokenizer.json
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,672 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,672 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:24,672 >> loading file tokenizer_config.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/tokenizer_config.json
[INFO|tokenization_utils_base.py:2470] 2024-10-21 18:28:24,951 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:675] 2024-10-21 18:28:25,799 >> loading configuration file config.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/config.json
[INFO|configuration_utils.py:742] 2024-10-21 18:28:25,801 >> Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-1.5B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.45.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file vocab.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/vocab.json
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file merges.txt from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/merges.txt
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file tokenizer.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/tokenizer.json
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2206] 2024-10-21 18:28:26,008 >> loading file tokenizer_config.json from cache at /home/user/.cache/huggingface/hub/models--Qwen--Qwen2-1.5B-Instruct/snapshots/ba1cf1846d7df0a0591d6c00649f57e798519da8/tokenizer_config.json
[INFO|tokenization_utils_base.py:2470] 2024-10-21 18:28:26,281 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
10/21/2024 18:28:26 - WARNING - llamafactory.model.loader - Processor was not found: 'Qwen2Config' object has no attribute 'vision_config'.
10/21/2024 18:28:26 - INFO - llamafactory.data.template - Replace eos token: <|im_end|>
10/21/2024 18:28:26 - INFO - llamafactory.data.loader - Loading dataset identity.json...
10/21/2024 18:28:26 - WARNING - llamafactory.model.loader - Processor was not found: 'Qwen2Config' object has no attribute 'vision_config'.
10/21/2024 18:28:26 - INFO - llamafactory.data.template - Replace eos token: <|im_end|>
[rank3]:[W1021 18:28:26.334106557 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 394.08 examples/s]
10/21/2024 18:28:27 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 3216.64 examples/s]
[rank2]:[W1021 18:28:28.615557891 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/home/user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/home/user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/home/user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank2]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank2]:   File "/home/user/LLaMA-Factory/src/llamafactory/data/loader.py", line 264, in get_dataset
[rank2]:     with training_args.main_process_first(desc="load dataset"):
[rank2]:   File "/home/user/enter/envs/train/lib/python3.10/contextlib.py", line 142, in __exit__
[rank2]:     next(self.gen)
[rank2]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/transformers/training_args.py", line 2442, in main_process_first
[rank2]:     dist.barrier()
[rank2]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank2]:     work = group.barrier(opts=opts)
[rank2]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: failed to recv, got 0 bytes
[rank2]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
[rank2]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdffaf6c446 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank2]: frame #1: <unknown function> + 0x5fed998 (0x7fdfeab5a998 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x35b (0x7fdfeab5764b in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x2a (0x7fdfeab579ca in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #4: c10d::TCPStore::get(std::string const&) + 0x7a (0x7fdfeab5883a in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfeab08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfeab08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfeab08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfeab08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7fdfb0e29b8f in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xfbd (0x7fdfb0e35b2d in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #11: <unknown function> + 0x11e02ce (0x7fdfb0e3e2ce in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #12: c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&) + 0x12c (0x7fdfb0e3f89c in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #13: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x476 (0x7fdfb0e4d176 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #14: <unknown function> + 0x5f8e3f2 (0x7fdfeaafb3f2 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #15: <unknown function> + 0x5f98bf5 (0x7fdfeab05bf5 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #16: <unknown function> + 0x55b224b (0x7fdfea11f24b in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #17: <unknown function> + 0x55afad9 (0x7fdfea11cad9 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #18: <unknown function> + 0x1a8c3f8 (0x7fdfe65f93f8 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #19: <unknown function> + 0x5fa2a74 (0x7fdfeab0fa74 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #20: <unknown function> + 0x5fa3805 (0x7fdfeab10805 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #21: <unknown function> + 0xdf7358 (0x7fdffa597358 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank2]: frame #22: <unknown function> + 0x4cb474 (0x7fdff9c6b474 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank2]: frame #23: /home/user/enter/envs/train/bin/python() [0x4fcaf7]
[rank2]: frame #24: _PyObject_MakeTpCall + 0x25b (0x4f657b in /home/user/enter/envs/train/bin/python)
[rank2]: frame #25: /home/user/enter/envs/train/bin/python() [0x50861f]
[rank2]: frame #26: _PyEval_EvalFrameDefault + 0x13b2 (0x4ee722 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #27: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #28: _PyEval_EvalFrameDefault + 0x2de4 (0x4f0154 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #29: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #30: _PyEval_EvalFrameDefault + 0x4b2c (0x4f1e9c in /home/user/enter/envs/train/bin/python)
[rank2]: frame #31: /home/user/enter/envs/train/bin/python() [0x56edd7]
[rank2]: frame #32: /home/user/enter/envs/train/bin/python() [0x4fd124]
[rank2]: frame #33: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #34: /home/user/enter/envs/train/bin/python() [0x50832e]
[rank2]: frame #35: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #36: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #37: PyObject_Call + 0xb8 (0x508cd8 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #38: _PyEval_EvalFrameDefault + 0x2de4 (0x4f0154 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #39: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #40: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #41: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #42: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #43: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #44: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #45: /home/user/enter/envs/train/bin/python() [0x5924f2]
[rank2]: frame #46: PyEval_EvalCode + 0x87 (0x592437 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #47: /home/user/enter/envs/train/bin/python() [0x5c3237]
[rank2]: frame #48: /home/user/enter/envs/train/bin/python() [0x5be380]
[rank2]: frame #49: /home/user/enter/envs/train/bin/python() [0x4598d6]
[rank2]: frame #50: _PyRun_SimpleFileObject + 0x19f (0x5b890f in /home/user/enter/envs/train/bin/python)
[rank2]: frame #51: _PyRun_AnyFileObject + 0x43 (0x5b8673 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #52: Py_RunMain + 0x38d (0x5b542d in /home/user/enter/envs/train/bin/python)
[rank2]: frame #53: Py_BytesMain + 0x39 (0x585609 in /home/user/enter/envs/train/bin/python)
[rank2]: frame #54: <unknown function> + 0x29d90 (0x7fdffbfb9d90 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]: frame #55: __libc_start_main + 0x80 (0x7fdffbfb9e40 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]: frame #56: /home/user/enter/envs/train/bin/python() [0x5854be]
[rank2]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/home/user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/home/user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/home/user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank3]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank3]:   File "/home/user/LLaMA-Factory/src/llamafactory/data/loader.py", line 264, in get_dataset
[rank3]:     with training_args.main_process_first(desc="load dataset"):
[rank3]:   File "/home/user/enter/envs/train/lib/python3.10/contextlib.py", line 135, in __enter__
[rank3]:     return next(self.gen)
[rank3]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/transformers/training_args.py", line 2433, in main_process_first
[rank3]:     dist.barrier()
[rank3]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank3]:     work = group.barrier(opts=opts)
[rank3]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: failed to recv, got 0 bytes
[rank3]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
[rank3]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd88df6c446 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank3]: frame #1: <unknown function> + 0x5fed998 (0x7fd87db5a998 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x35b (0x7fd87db5764b in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x2a (0x7fd87db579ca in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #4: c10d::TCPStore::get(std::string const&) + 0x7a (0x7fd87db5883a in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd87db08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd87db08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd87db08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd87db08bc1 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7fd843e29b8f in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xfbd (0x7fd843e35b2d in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #11: <unknown function> + 0x11e02ce (0x7fd843e3e2ce in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #12: c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&) + 0x12c (0x7fd843e3f89c in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #13: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x476 (0x7fd843e4d176 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #14: <unknown function> + 0x5f8e3f2 (0x7fd87dafb3f2 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #15: <unknown function> + 0x5f98bf5 (0x7fd87db05bf5 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #16: <unknown function> + 0x55b224b (0x7fd87d11f24b in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #17: <unknown function> + 0x55afad9 (0x7fd87d11cad9 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #18: <unknown function> + 0x1a8c3f8 (0x7fd8795f93f8 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #19: <unknown function> + 0x5fa2a74 (0x7fd87db0fa74 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #20: <unknown function> + 0x5fa3805 (0x7fd87db10805 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #21: <unknown function> + 0xdf7358 (0x7fd88d597358 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank3]: frame #22: <unknown function> + 0x4cb474 (0x7fd88cc6b474 in /home/user/enter/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank3]: frame #23: /home/user/enter/envs/train/bin/python() [0x4fcaf7]
[rank3]: frame #24: _PyObject_MakeTpCall + 0x25b (0x4f657b in /home/user/enter/envs/train/bin/python)
[rank3]: frame #25: /home/user/enter/envs/train/bin/python() [0x50861f]
[rank3]: frame #26: _PyEval_EvalFrameDefault + 0x13b2 (0x4ee722 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #27: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #28: _PyEval_EvalFrameDefault + 0x2de4 (0x4f0154 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #29: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #30: _PyEval_EvalFrameDefault + 0x4b2c (0x4f1e9c in /home/user/enter/envs/train/bin/python)
[rank3]: frame #31: /home/user/enter/envs/train/bin/python() [0x56edd7]
[rank3]: frame #32: /home/user/enter/envs/train/bin/python() [0x4fd124]
[rank3]: frame #33: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #34: /home/user/enter/envs/train/bin/python() [0x5085b7]
[rank3]: frame #35: _PyEval_EvalFrameDefault + 0x2819 (0x4efb89 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #36: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #37: PyObject_Call + 0xb8 (0x508cd8 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #38: _PyEval_EvalFrameDefault + 0x2de4 (0x4f0154 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #39: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #40: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #41: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #42: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #43: _PyFunction_Vectorcall + 0x6f (0x4fcf3f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #44: _PyEval_EvalFrameDefault + 0x31f (0x4ed68f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #45: /home/user/enter/envs/train/bin/python() [0x5924f2]
[rank3]: frame #46: PyEval_EvalCode + 0x87 (0x592437 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #47: /home/user/enter/envs/train/bin/python() [0x5c3237]
[rank3]: frame #48: /home/user/enter/envs/train/bin/python() [0x5be380]
[rank3]: frame #49: /home/user/enter/envs/train/bin/python() [0x4598d6]
[rank3]: frame #50: _PyRun_SimpleFileObject + 0x19f (0x5b890f in /home/user/enter/envs/train/bin/python)
[rank3]: frame #51: _PyRun_AnyFileObject + 0x43 (0x5b8673 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #52: Py_RunMain + 0x38d (0x5b542d in /home/user/enter/envs/train/bin/python)
[rank3]: frame #53: Py_BytesMain + 0x39 (0x585609 in /home/user/enter/envs/train/bin/python)
[rank3]: frame #54: <unknown function> + 0x29d90 (0x7fd88ef3fd90 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]: frame #55: __libc_start_main + 0x80 (0x7fd88ef3fe40 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]: frame #56: /home/user/enter/envs/train/bin/python() [0x5854be]
[rank3]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank2]:[W1021 18:28:29.514166475 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W1021 18:28:30.343000 16321 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 16381 closing signal SIGTERM
E1021 18:28:30.509000 16321 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 16382) of binary: /home/user/enter/envs/train/bin/python
Traceback (most recent call last):
  File "/home/user/enter/envs/train/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/enter/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/user/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-21_18:28:30
  host      : a82f44beb-ab8b-41f5-b995-1adc32a600e0
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 16382)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

hiennguyennq commented Oct 21, 2024 •

edited

Loading

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

Comments

hiennguyennq commented Oct 21, 2024 • edited Loading

model

method

dataset

output

hiennguyennq commented Oct 21, 2024 •

edited

Loading