Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Tophness · 2023-03-04T07:35:43Z

(venv) D:\Downloads\LLaMA>torchrun --nproc_per_node 2 example.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
        fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9708) of binary: D:\Downloads\LLaMA\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Downloads\LLaMA\venv\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 762, in main
    run(args)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 753, in run
    elastic_launch(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2288)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9708)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

neuhaus · 2023-03-04T10:57:50Z

nccl is not available on Windows. Switch to Linux or change "nccl" to "gloo" here in example.py

Tophness · 2023-03-04T19:17:39Z

Won't that use CPU instead of GPU?

Inserian · 2023-03-04T19:40:14Z

NCCL is a pain. I'm assuming you are running this on windows in conda or similar environment? The easiest way is to just deal with hpc-sdk as it includes nccl. However you will most likely will have to download the tar from nvidia, and extract it yourself. Ensure you have full privileges or it won't work.
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

TanaroSch · 2023-03-06T23:40:27Z

@Inserian I encounter the same error on ubuntu 20.04 with nvidia-hpc-sdk module enabled. Do you know if there might be another error preventing llama from using nccl?

Tophness · 2023-03-07T08:40:07Z

I assumed we would just be running the smaller models on our own GPU without distributed training.
Any chance an rtx 4080 can run 13B if we trade off VRAM for generation time?

MaximilianDueppe · 2023-07-23T18:17:20Z

I had the same issues y´all described. So i tried everything i could find, and finally i found my problem. If you install pytorch via conda, the standard package is cpu only. I will provide a link where you can find further information on how to download the gpu variant for pytorch.

https://pytorch.org/get-started/locally/

I hope this helps at least some of you.

WuhanMonkey · 2023-09-06T16:35:56Z

Seems like the issue is resolved by suggestions above. Please re-open as needed with more detail.

WuhanMonkey added the model-usage issues related to how models are used/loaded label Sep 6, 2023

WuhanMonkey closed this as completed Sep 6, 2023

andrewchungg mentioned this issue Sep 7, 2023

Issue downloading weights and Tokenizer #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Tophness commented Mar 4, 2023

neuhaus commented Mar 4, 2023 •

edited

Loading

Tophness commented Mar 4, 2023

Inserian commented Mar 4, 2023

TanaroSch commented Mar 6, 2023

Tophness commented Mar 7, 2023

MaximilianDueppe commented Jul 23, 2023

WuhanMonkey commented Sep 6, 2023

Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Comments

Tophness commented Mar 4, 2023

neuhaus commented Mar 4, 2023 • edited Loading

Tophness commented Mar 4, 2023

Inserian commented Mar 4, 2023

TanaroSch commented Mar 6, 2023

Tophness commented Mar 7, 2023

MaximilianDueppe commented Jul 23, 2023

WuhanMonkey commented Sep 6, 2023

neuhaus commented Mar 4, 2023 •

edited

Loading