NCCL error #1726

maxmelichov · 2023-11-20T13:40:48Z

I'm trying to load model into LLM(model="meta-llama/Llama-2-7b-chat-hf") and I'm getting the error below

DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648

The text was updated successfully, but these errors were encountered:

WoosukKwon · 2023-11-20T18:01:07Z

Hi @maxmelichov, in my experience, the error happened when using old version of pytorch. Please make sure to use torch2.1.0+cu121 by running pip install --upgrade torch.

insunkang · 2023-11-21T07:57:00Z

same issue in cuda 11.8 , torch2.1.0+cu118

Grey4sh · 2023-11-22T03:28:29Z

same issue in cuda 12.1, torch 2.1.1 + cu121, did u solve it ?

sanfuyee · 2023-11-22T09:56:59Z

occur the same problem

NintendoLink · 2023-11-23T05:33:48Z

same error.
torch 2.1.1
torchaudio 2.1.1
torchvision 0.16.1

sam-iink · 2023-11-27T12:13:17Z

Same issue for me after I updated to 0.2.2 with Pytorch '2.1.1+cu121'.
0.2.1-post1 with '2.0.1+cu11.8' was working.

jxh4945777 · 2023-11-28T10:15:08Z

same issue

ywglf · 2023-11-28T11:16:47Z

pip list | grep nccl
to check if you have two versions， you should remove the unnecessary one

jxh4945777 · 2023-11-29T04:09:43Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

THX, solved

noelEOS · 2024-01-10T07:35:24Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

This solved it for me as well. Thanks!

BeastyZ · 2024-01-17T07:10:39Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

It works for me, too. Thanks!

goswamig · 2024-01-30T22:55:50Z

pip install --upgrade torch

solved the issue for me.

torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2

iseesaw · 2024-02-01T05:51:27Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

Thx, I love you!

songkq · 2024-02-22T07:26:51Z

Same error in env. Could you please give some advice?

CUDA_VISIBLE_DEVICES="5,7" python3 -m vllm.entrypoints.openai.api_server --model Qwen1.5-72B-Chat-GPTQ-Int4 --port 21002 --gpu-memory-utilization 0.98 --tensor-parallel-size 2

nvidia-nccl-cu12==2.18.1
torch==2.1.2+cu121
vllm==0.3.0 or 0.3.1

File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/workspace/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.16.5
ncclInvalidArgument: Invalid value for an argument.

jinfengfeng · 2024-02-26T01:52:50Z

@songkq I got the same issue. Have you solved it?

songkq · 2024-02-26T06:13:38Z

@songkq I got the same issue. Have you solved it?

Not yet.

@WoosukKwon Could you please give some advice for this issue?

songkq · 2024-02-28T08:25:39Z

Same error in env. Could you please give some advice?

CUDA_VISIBLE_DEVICES="5,7" python3 -m vllm.entrypoints.openai.api_server --model Qwen1.5-72B-Chat-GPTQ-Int4 --port 21002 --gpu-memory-utilization 0.98 --tensor-parallel-size 2

nvidia-nccl-cu12==2.18.1
torch==2.1.2+cu121
vllm==0.3.0 or 0.3.1

File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/workspace/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.16.5
ncclInvalidArgument: Invalid value for an argument.

@jinfengfeng Solved by upgrading to vllm==0.3.2

AlexYoung757 · 2024-03-19T07:04:51Z

@songkq maybe you should retry install nccl. url:https://developer.nvidia.com/nccl/nccl-legacy-downloads
（1）update nccl mirrors
sudo dpkg -i nccl-local-repo-xxx.deb
（2）install nccl
sudo apt install libnccl2=2.18.1-1+cuda12.1 libnccl-dev=2.18.1-1+cuda12.1
（3）add nccl environment
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu

timbmg · 2024-04-29T08:08:40Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?

pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0

AlexYoung757 · 2024-05-07T04:57:12Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?
pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0

only nvidia-nccl is necessary

ruifengma · 2024-05-14T07:03:37Z

pip list | grep nccl to check if you have two versions， you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?
pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0
only nvidia-nccl is necessary

what's vllm-nccl-cu12 for?

hmellor · 2024-05-31T20:20:50Z

vllm-nccl-cu12 was a workaround to pin the NCCL version when we upgraded to PyTorch 2.2.

NCCL 2.19 (which was the new default with PyTorch 2.2) was using much more memory than NCCL 2.18 so we pinned NCCL and proceeded with the PyTorch 2.2 upgrade.

A newer workaround has since been found so vllm-nccl-cu12 is no longer necessary.

You can read:

from this comment onwards to learn about the PyTorch 2.2 upgrade Upgrade to torch==2.2.1 #2804 (comment)
this issue to learn about the issue with NCCL Report of increased memory overhead during cudagraph capture with nccl >= 2.19 NVIDIA/nccl#1234
this PR where the workaround was replaced with a new workaround which meant we can use the latest NCCL again [Core][Optimization] remove vllm-nccl #5091

hmellor closed this as completed May 31, 2024

wjj19950828 mentioned this issue Jun 13, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL error #1726

NCCL error #1726

maxmelichov commented Nov 20, 2023 •

edited

Loading

WoosukKwon commented Nov 20, 2023

insunkang commented Nov 21, 2023

Grey4sh commented Nov 22, 2023

sanfuyee commented Nov 22, 2023

NintendoLink commented Nov 23, 2023

sam-iink commented Nov 27, 2023

jxh4945777 commented Nov 28, 2023

ywglf commented Nov 28, 2023

jxh4945777 commented Nov 29, 2023

noelEOS commented Jan 10, 2024

BeastyZ commented Jan 17, 2024

goswamig commented Jan 30, 2024

iseesaw commented Feb 1, 2024

songkq commented Feb 22, 2024 •

edited

Loading

jinfengfeng commented Feb 26, 2024 •

edited

Loading

songkq commented Feb 26, 2024

songkq commented Feb 28, 2024

AlexYoung757 commented Mar 19, 2024

timbmg commented Apr 29, 2024

AlexYoung757 commented May 7, 2024

ruifengma commented May 14, 2024

hmellor commented May 31, 2024

NCCL error #1726

NCCL error #1726

Comments

maxmelichov commented Nov 20, 2023 • edited Loading

WoosukKwon commented Nov 20, 2023

insunkang commented Nov 21, 2023

Grey4sh commented Nov 22, 2023

sanfuyee commented Nov 22, 2023

NintendoLink commented Nov 23, 2023

sam-iink commented Nov 27, 2023

jxh4945777 commented Nov 28, 2023

ywglf commented Nov 28, 2023

jxh4945777 commented Nov 29, 2023

noelEOS commented Jan 10, 2024

BeastyZ commented Jan 17, 2024

goswamig commented Jan 30, 2024

iseesaw commented Feb 1, 2024

songkq commented Feb 22, 2024 • edited Loading

jinfengfeng commented Feb 26, 2024 • edited Loading

songkq commented Feb 26, 2024

songkq commented Feb 28, 2024

AlexYoung757 commented Mar 19, 2024

timbmg commented Apr 29, 2024

AlexYoung757 commented May 7, 2024

ruifengma commented May 14, 2024

hmellor commented May 31, 2024

maxmelichov commented Nov 20, 2023 •

edited

Loading

songkq commented Feb 22, 2024 •

edited

Loading

jinfengfeng commented Feb 26, 2024 •

edited

Loading