Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed package doesn't have NCCL / The requested address is not valid in its context. #104

Closed
Tophness opened this issue Mar 4, 2023 · 7 comments
Labels
model-usage issues related to how models are used/loaded

Comments

@Tophness
Copy link

Tophness commented Mar 4, 2023

(venv) D:\Downloads\LLaMA>torchrun --nproc_per_node 2 example.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ChrisPC]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
  File "D:\Downloads\LLaMA\example.py", line 119, in <module>
        fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
fire.Fire(main)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    component = fn(*varargs, **kwargs)
  File "D:\Downloads\LLaMA\example.py", line 74, in main
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    local_rank, world_size = setup_model_parallel()
  File "D:\Downloads\LLaMA\example.py", line 23, in setup_model_parallel
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    torch.distributed.init_process_group("nccl")
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    default_pg = _new_process_group_helper(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9708) of binary: D:\Downloads\LLaMA\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\chris\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Downloads\LLaMA\venv\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 762, in main
    run(args)
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\run.py", line 753, in run
    elastic_launch(
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\Downloads\LLaMA\venv\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2288)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-04_18:21:06
  host      : ChrisPC
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9708)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@neuhaus
Copy link

neuhaus commented Mar 4, 2023

nccl is not available on Windows. Switch to Linux or change "nccl" to "gloo" here in example.py

@Tophness
Copy link
Author

Tophness commented Mar 4, 2023

Won't that use CPU instead of GPU?

@Inserian
Copy link

Inserian commented Mar 4, 2023

NCCL is a pain. I'm assuming you are running this on windows in conda or similar environment? The easiest way is to just deal with hpc-sdk as it includes nccl. However you will most likely will have to download the tar from nvidia, and extract it yourself. Ensure you have full privileges or it won't work.
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

@TanaroSch
Copy link

@Inserian I encounter the same error on ubuntu 20.04 with nvidia-hpc-sdk module enabled. Do you know if there might be another error preventing llama from using nccl?

@Tophness
Copy link
Author

Tophness commented Mar 7, 2023

I assumed we would just be running the smaller models on our own GPU without distributed training.
Any chance an rtx 4080 can run 13B if we trade off VRAM for generation time?

@MaximilianDueppe
Copy link

I had the same issues y´all described. So i tried everything i could find, and finally i found my problem. If you install pytorch via conda, the standard package is cpu only. I will provide a link where you can find further information on how to download the gpu variant for pytorch.

https://pytorch.org/get-started/locally/

I hope this helps at least some of you.

@WuhanMonkey WuhanMonkey added the model-usage issues related to how models are used/loaded label Sep 6, 2023
@WuhanMonkey
Copy link

Seems like the issue is resolved by suggestions above. Please re-open as needed with more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage issues related to how models are used/loaded
Projects
None yet
Development

No branches or pull requests

6 participants