Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Distributed package doesn't have NCCL built in #112

Closed
qsimeon opened this issue Mar 4, 2023 · 30 comments
Closed

RuntimeError: Distributed package doesn't have NCCL built in #112

qsimeon opened this issue Mar 4, 2023 · 30 comments
Labels
model-usage issues related to how models are used/loaded

Comments

@qsimeon
Copy link

qsimeon commented Mar 4, 2023

I was able to download the 7B weights on Mac OS Monterey. I get the following errors when I try to call the example from the README in my Terminal: torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model

RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51512) of binary: /Users/username/opt/anaconda3/envs/pytorch/bin/python
.
.
.
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-04_14:30:38
  host      : COMPUTER.tld
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 51512)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@Inserian
Copy link

Inserian commented Mar 4, 2023

You will have to manually add nccl. Make sure you have full privileges before choosing your install from nvidia. HPC-SDK is easiest, but downloading the tar and extracting to usr\local works the same.
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

@tekspirit
Copy link

I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?

@tekspirit
Copy link

I am on a mac pro with M1 max using 20 gpus - any idea how to resolve the nccl issue given no nVidia cards are installed?

I run print(torch.backends.mps.is_built()) and it returns TRUE but when I set torch.distributed.init_process_group("mps") in example,py and run it, it complains mps cannot be found.
error ValueError: Invalid backend: 'mps'
Any ideas for getting the backend to run on m1?

@Inserian
Copy link

Inserian commented Mar 5, 2023

You can't resolve nccl issues without nvidia. There are other process's that could be used as opposed to nccl, there are also other libraries which allow parallel work around but I haven't bothered with them yet.
As far as torch run, it looks like you didn't input your MP value? If that still doesn't work, try python -m torch.distributed.run --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model (editing the mp value of course).

@andrewssobral
Copy link

Hello guys,
I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution?

@Eurus-Holmes
Copy link

same issue

@tekspirit
Copy link

me too!

@bcouetil
Copy link

I'm on a Macbook Pro M1 2022 and have the same problem.

@bdabykov
Copy link

bdabykov commented Apr 5, 2023

Did anyone find out how to solve this error?
I am having the same issue here.

@bcouetil
Copy link

bcouetil commented Apr 5, 2023

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp

Works like a charm on my side, with the 3 models that fit in my RAM ✌️

@signalprime
Copy link

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp

Works like a charm on my side, with the 3 models that fit in my RAM ✌️

is it utilizing MPS acceleration from the M1 / M2 chip?

@AngelTs
Copy link

AngelTs commented May 11, 2023

I also have the NCCL error:
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
untimeError: Distributed package doesn't have NCCL built in

@Sunjung-Dev
Copy link

I have same problem ... I use M1 pro

@araby123
Copy link

araby123 commented Jul 19, 2023

same issue I have on MacBook Pro m1 16gb

raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

@bcouetil
Copy link

For MacOS, we have to use the C++ implementation : https://github.com/ggerganov/llama.cpp
Works like a charm on my side, with the 3 models that fit in my RAM ✌️

is it utilizing MPS acceleration from the M1 / M2 chip?

It utilizes my iGPU to it's fullest, and not much CPU, if this is your question.

@byronrode
Copy link

There is a bit of customisation required to the newer model.py and generation.py files at minimum.

You need to register the mps device device = torch.device('mps') and then reference that in a few places, as well as changing .cuda() to .to(device)

torch.distributed.init_process_group("gloo") is another change to make from nccl

There are also a number of other cuda references in torch that have to change, including tensors.

@sixian-C
Copy link

I have the same error when running torchrun --nproc_per_node 1 example.py --ckpt_dir download/model_size --tokenizer_path download/tokenizer.model in my windows 11 conda environment, any solution?

@aggiee
Copy link

aggiee commented Jul 21, 2023

I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps)

@3zerevelt
Copy link

3zerevelt commented Jul 21, 2023

I was able to run Llama 2 7B on Mac M2 with (https://github.com/aggiee/llama-v2-mps) @aggiee

your code returns an error message indicating that the function torch.polar() is not implemented for the Metal Performance Shaders (MPS) ?? I'm also running on an M2 Mac.

@g8gg
Copy link

g8gg commented Jul 22, 2023

I have same problem ... I use M1 pro

@aggiee
Copy link

aggiee commented Jul 22, 2023

If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it:
"UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
"

The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)

@araby123
Copy link

If you are referring to the following message, it is expected. It's due to M1/M2/MPS not supporting polar.out operator. It falls back to CPU for that specific operation and the warning is to inform the user about it: "UserWarning: The operator 'aten::polar.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1aidzjezue/croot/pytorch_1687856425340/work/aten/src/ATen/mps/MPSFallback.mm:11.) freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64"

The solution is to set PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to run this code. That should make it work (you will still see the user warning about polar.out, but the code should run past that)

this work but have to much slow in performance

@REASY
Copy link

REASY commented Jul 27, 2023

In case you run Windows 10 as me, I had the same RuntimeError: Distributed package doesn't have NCCL built in error. To fix it I checked the code of Llama class https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/llama/generation.py#L61-L62 and saw how torch.distributed is initialized. One can check all possible backends at distributed.html#torch.distributed.init_process_group. I changed the code to initialize it with gloo backend, dist.init_process_group(backend="gloo")

git diff:

Index: example_text_completion.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/example_text_completion.py b/example_text_completion.py
--- a/example_text_completion.py	(revision 6c7fe276574e78057f917549435a2554000a876d)
+++ b/example_text_completion.py	(date 1690453793087)
@@ -5,6 +5,9 @@
 
 from llama import Llama
 
+import torch
+import torch.distributed as dist
+
 
 def main(
     ckpt_dir: str,
@@ -15,6 +18,8 @@
     max_gen_len: int = 64,
     max_batch_size: int = 4,
 ):
+    dist.init_process_group(backend="gloo")
+
     generator = Llama.build(
         ckpt_dir=ckpt_dir,
         tokenizer_path=tokenizer_path,
@@ -52,4 +57,5 @@
 
 
 if __name__ == "__main__":
+    print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
     fire.Fire(main)

After that change I was able to run

(base) H:\github\facebook\llama>torchrun --standalone --nnodes=1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Cuda support: True : 1 devices
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 13.40 seconds
I believe the meaning of life is
> to be happy. I believe we are all born with the potential to be happy. The meaning of life is to be happy, but the way to get there is not always easy.
The meaning of life is to be happy. It is not always easy to be happy, but it is possible. I believe that

==================================

Simply put, the theory of relativity states that
> 1) time, space, and mass are relative, and 2) the speed of light is constant, regardless of the relative motion of the observer.
Let’s look at the first point first.
Relative Time and Space
The theory of relativity is built on the idea that time and space are relative

==================================

A brief message congratulating the team on the launch:

        Hi everyone,

        I just
> wanted to say a big congratulations to the team on the launch of the new website.

        I think it looks fantastic and I'm sure it will be a huge success.

        I look forward to working with you all on the next project.

        Best wishes



==================================

Translate English to French:

        sea otter => loutre de mer
        peppermint => menthe poivrée
        plush girafe => girafe peluche
        cheese =>
> fromage
        fish => poisson
        giraffe => girafe
        elephant => éléphant
        cat => chat
        giraffe => girafe
        elephant => éléphant
        cat => chat
        giraffe => gira

==================================

Make sure you have enough RAM and GPU RAM, my RAM consumption when the model is loaded
image

GPU ram:

Thu Jul 27 18:48:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  | 00000000:0C:00.0  On |                  Off |
| 30%   40C    P2             151W / 450W |  15160MiB / 24564MiB |     53%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

If you get OOM error like below that but you have enough GPU RAM:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 23.99 GiB total capacity; 7.55 GiB already allocated; 14.84 GiB free; 7.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21068) of binary: c:\Users\User\miniconda3\python.exe

make sure that you actually have enough RAM. You can modify PageFile to use disk as memory, see https://gist.github.com/REASY/567c48e021288df505140cad7e4562ab?permalink_comment_id=4650490#gistcomment-4650490

Note: I fixed torchrun, one can modify torchrun-script.py to make it work. In my case, I use miniconda, the full path is c:\Users\User\miniconda3\Scripts\torchrun-script.py and I had to fix the first line of that to point to the full path of Python shipped with miniconda:

#!c:\Users\User\miniconda3\python.exe

My env gathered via python -m torch.utils.collect_env

(base) H:\github\facebook\llama>python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro N
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3493
DeviceID=CPU0
Family=107
L2CacheSize=8192
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3493
Name=AMD Ryzen 9 3950X 16-Core Processor
ProcessorType=3
Revision=28928

Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.8.0               hd77b12b_0
[conda] mkl                       2023.1.0         h8bd8f75_46356
[conda] mkl-service               2.4.0           py311h2bbff1b_1
[conda] mkl_fft                   1.3.6           py311hf62ec03_1
[conda] mkl_random                1.2.2           py311hf62ec03_1
[conda] numpy                     1.25.1                   pypi_0    pypi
[conda] numpy-base                1.25.0          py311hd01c5d8_0
[conda] pytorch                   2.0.1           py3.11_cuda11.8_cudnn8_0    pytorch
[conda] pytorch-cuda              11.8                 h24eeafa_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torchaudio                2.0.2+cu117              pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi

@MDFARHYN
Copy link

MDFARHYN commented Jul 30, 2023

Just initialized with torch.distributed.init_process_group("gloo") go to the generation.py file and find the following line

 if not torch.distributed.is_initialized():
            if device == "cuda":
                torch.distributed.init_process_group("nccl")
            else:
                torch.distributed.init_process_group("gloo") 

change it to

 if not torch.distributed.is_initialized():
            if device == "cuda":
                 torch.distributed.init_process_group("gloo")
                
            else:
                torch.distributed.init_process_group("nccl")

@WuhanMonkey
Copy link

Seems like the issue was resolved with suggestions above. Feel free to re-open as needed. Closing

@WuhanMonkey WuhanMonkey added the model-usage issues related to how models are used/loaded label Sep 6, 2023
@dunanyang
Copy link

Why does we still don't have a solution to this error?

@psmyrdek
Copy link

I've been able to start execution after applying changes similar to https://github.com/facebookresearch/codellama/pull/18/files

@pianistprogrammer
Copy link

https://github.com/pianistprogrammer/llama3/tree/main, get this one, clone the repo, i have made changes to some files to make it work. You can find it in the commit tree

@haruelrovix
Copy link

Hey @pianistprogrammer 👋🏻

I tried your fork but got an error:

RuntimeError: Placeholder storage has not been allocated on MPS device!

It's M1 Pro. Any clue what is the issue?


Full logs:

(base) ➜  llama3-pianist git:(main) ✗ PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 128 --max_batch_size 4
W0513 11:17:12.135000 8470690496 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/miniconda3/lib/python3.12/site-packages/torch/__init__.py:747: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:433.)
  _C._set_default_tensor_type(t)
Loaded in 37.49 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/llama3-pianist/example_text_completion.py", line 64, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/example_text_completion.py", line 51, in main
[rank0]:     results = generator.text_completion(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/generation.py", line 282, in text_completion
[rank0]:     generation_tokens, generation_logprobs = self.generate(
[rank0]:                                              ^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/generation.py", line 201, in generate
[rank0]:     logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llama3-pianist/llama/model.py", line 291, in forward
[rank0]:     h = self.tok_embeddings(tokens)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/fairscale/nn/model_parallel/layers.py", line 136, in forward
[rank0]:     output_parallel = F.embedding(
[rank0]:                       ^^^^^^^^^^^^
[rank0]:   File "/opt/miniconda3/lib/python3.12/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Placeholder storage has not been allocated on MPS device!
E0513 11:17:57.237000 8470690496 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 5741) of binary: /opt/miniconda3/bin/python
Traceback (most recent call last):
  File "/opt/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_text_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-13_11:17:57
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5741)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@pianistprogrammer
Copy link

I'm sorry about that, i have made a blog post on how to get it locally, https://questionbump.com/question/how-can-i-run-chatgpt-using-llms-locally/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage issues related to how models are used/loaded
Projects
None yet
Development

No branches or pull requests