Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error #1726

Closed
maxmelichov opened this issue Nov 20, 2023 · 22 comments
Closed

NCCL error #1726

maxmelichov opened this issue Nov 20, 2023 · 22 comments

Comments

@maxmelichov
Copy link

maxmelichov commented Nov 20, 2023

I'm trying to load model into LLM(model="meta-llama/Llama-2-7b-chat-hf") and I'm getting the error below

DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648
@WoosukKwon
Copy link
Collaborator

Hi @maxmelichov, in my experience, the error happened when using old version of pytorch. Please make sure to use torch2.1.0+cu121 by running pip install --upgrade torch.

@insunkang
Copy link

same issue in cuda 11.8 , torch2.1.0+cu118

@Grey4sh
Copy link

Grey4sh commented Nov 22, 2023

same issue in cuda 12.1, torch 2.1.1 + cu121, did u solve it ?

@sanfuyee
Copy link

occur the same problem

@NintendoLink
Copy link

same error.
torch 2.1.1
torchaudio 2.1.1
torchvision 0.16.1

@sam-iink
Copy link

Same issue for me after I updated to 0.2.2 with Pytorch '2.1.1+cu121'.
0.2.1-post1 with '2.0.1+cu11.8' was working.

@jxh4945777
Copy link

same issue

@ywglf
Copy link

ywglf commented Nov 28, 2023

pip list | grep nccl
to check if you have two versions, you should remove the unnecessary one

@jxh4945777
Copy link

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

THX, solved

@noelEOS
Copy link

noelEOS commented Jan 10, 2024

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

This solved it for me as well. Thanks!

@BeastyZ
Copy link

BeastyZ commented Jan 17, 2024

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

It works for me, too. Thanks!

@goswamig
Copy link

pip install --upgrade torch

solved the issue for me.

torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2

@iseesaw
Copy link

iseesaw commented Feb 1, 2024

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

Thx, I love you!

@songkq
Copy link

songkq commented Feb 22, 2024

Same error in env. Could you please give some advice?

CUDA_VISIBLE_DEVICES="5,7" python3 -m vllm.entrypoints.openai.api_server --model Qwen1.5-72B-Chat-GPTQ-Int4 --port 21002 --gpu-memory-utilization 0.98 --tensor-parallel-size 2

nvidia-nccl-cu12==2.18.1
torch==2.1.2+cu121
vllm==0.3.0 or 0.3.1
File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/workspace/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.16.5
ncclInvalidArgument: Invalid value for an argument.

@jinfengfeng
Copy link

jinfengfeng commented Feb 26, 2024

@songkq I got the same issue. Have you solved it?

@songkq
Copy link

songkq commented Feb 26, 2024

@songkq I got the same issue. Have you solved it?

Not yet.

@WoosukKwon Could you please give some advice for this issue?

@songkq
Copy link

songkq commented Feb 28, 2024

Same error in env. Could you please give some advice?

CUDA_VISIBLE_DEVICES="5,7" python3 -m vllm.entrypoints.openai.api_server --model Qwen1.5-72B-Chat-GPTQ-Int4 --port 21002 --gpu-memory-utilization 0.98 --tensor-parallel-size 2

nvidia-nccl-cu12==2.18.1
torch==2.1.2+cu121
vllm==0.3.0 or 0.3.1
File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/workspace/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/workspace/miniconda3/envs/collie/lib/python3.10/site-packages/torch-2.1.2-py3.10-linux-x86_64.egg/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.16.5
ncclInvalidArgument: Invalid value for an argument.

@jinfengfeng Solved by upgrading to vllm==0.3.2

@AlexYoung757
Copy link

@songkq maybe you should retry install nccl. url:https://developer.nvidia.com/nccl/nccl-legacy-downloads
(1)update nccl mirrors
sudo dpkg -i nccl-local-repo-xxx.deb
(2)install nccl
sudo apt install libnccl2=2.18.1-1+cuda12.1 libnccl-dev=2.18.1-1+cuda12.1
(3)add nccl environment
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu

@timbmg
Copy link

timbmg commented Apr 29, 2024

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?

pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0

@AlexYoung757
Copy link

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?

pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0

only nvidia-nccl is necessary

@ruifengma
Copy link

pip list | grep nccl to check if you have two versions, you should remove the unnecessary one

Thanks for the suggestion @ywglf. pip list | grep nccl shows me an nvidia and vllm nccl. which one is unnecessary in this case?

pip list | grep nccl
nvidia-nccl-cu12          2.19.3
vllm-nccl-cu12            2.18.1.0.4.0

only nvidia-nccl is necessary

what's vllm-nccl-cu12 for?

@hmellor
Copy link
Collaborator

hmellor commented May 31, 2024

vllm-nccl-cu12 was a workaround to pin the NCCL version when we upgraded to PyTorch 2.2.

NCCL 2.19 (which was the new default with PyTorch 2.2) was using much more memory than NCCL 2.18 so we pinned NCCL and proceeded with the PyTorch 2.2 upgrade.

A newer workaround has since been found so vllm-nccl-cu12 is no longer necessary.

You can read:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests