-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: LLM is not getting loaded on multiple GPUs but works fine on a single GPU #3974
Comments
@venki-lfc when stalled, what does nvidia-smi report for GPU %load and memory usage? |
Same here. Both for 0.3.3 and 0.4.0 |
Thanks @venki-lfc this matches my experience, 0.4.0-post1 on only 4 specific GPUs on an 8x H100-PCIe system: I only see this behavior for these 4 specific GPUs on the system; other configurations (e.g. 1, 2, or 8 GPU) appear unaffected even when they utilize the same hardware. I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch).
|
So far we're seeing this on AMD and Intel CPU and Ada/Hopper GPU's. (My collect_env output is @ #3892). Testing various releases of the stock Docker containers with Llama2-70B, I see: v.0.4.0: Hangs (nvidia-nccl-cu12 2.18.1; It was straightforward to test by swapping containers, but I won't have time to perform a full bisect/rebuild 0.3.3->0.4.0 for a few weeks. |
@agt did you change nccl version via |
@youkaichao That's the version shipped in https://hub.docker.com/r/vllm/vllm-openai - happy to swap in a new version via VLLM_NCCL_SO_PATH, which would you suggest? |
I just did a |
I just found a solution that works!
This works for me now :) |
@venki-lfc glad to hear! Disabling P2P will hurt performance, so I'd like to continue pursuing - want to keep this issue open, or should I create a new one? |
I guess we can keep the issue open :) Obviously mine's just a work around and doesn't address the root cause of the issue. |
Ahh - 2.17.1 was the system NCCL installed under /usr/lib; the PyPi version was hiding under 'libnccl.so.2', and is indeed NCCL version 2.18.1+cuda12.1 . That's consistent with the Pytorch 2.1.2 requirements. |
Good job! nccl is quite a black box, and we have a hard time with it :( |
I've tested on both 8xH100 and 8xA100-40GB and cannot seem to load a model on even
|
|
@youkaichao Thank you for #4079 - throwing that into my 0.4.0.post1 container, I found that the 3 non-lead workers all were stuck within CustomAllreduce._gather_ipc_meta() despite the code's intention to disable CustomAllReduce as "it's not supported on more than two PCIe-only GPUs.". Last call logged across the various processes before the stall: ( full @ vllm_trace_frame_for_process.tgz )
Launching with @venki-lfc @nidhishs would you mind checking whether adding that flag fixes things for you? (I believe disabling custom AllReduce should impact performance less than disabling P2P alltogether.) |
Hello @agt , |
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
Hi @venki-lfc, sorry to hear that didn't work! 0.4.1 will include an option to log all function calls, perhaps doing so will identify the culprit as it did for me. I'd be happy to review if you post that info in a new bug. |
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
I am using Command with output
My environment
|
…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)
Your current environment
🐛 Describe the bug
When I try to load the model by using the following command
The model is not loaded at all, I get the following information on the CLI and that's it. The loading is never finished.
I can see that the 2 GPU devices are occupied while the above message is displayed, but nothing else. The line of code is never fully executed.
When I try to load the model using only one GPU, the loading process is smooth.
Below is the screenshot of the successfull loading message:
The llm inference is quite fast and everyhting works as expected.
So the problem clearly lies with multiple GPUs. This issue happens with all the models and not particular to just one organisation. Can someone please help me in this regard? What am I doing wrong? Is it something due to nccl or is something mssing?
Any help is appreciated, thanks :)
The text was updated successfully, but these errors were encountered: