-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Engine iteration timed out. This should never happen! #4430
Comments
How easy is it to reproduce the issue? Also, Is it possible to reproduce it with CUDA_LAUNCH_BLOCKING=1 and show us the line? |
It's about 1/10 I think. It seemed to be very random, at least not directly caused by request concurrency, nor prompt length per our observation. |
We just tried. Here's the stacktrace with the env variable
And this time, the thread stack is different:
|
FYI, we actually deployed several instances. They're running on different envs. The following instances have been running for more than 5 days without any problem:
The problematic instance was running on:
vLLM 0.3.3 could run on the same env without any problem. We replaced the host with one with the same env (A800 + docker 20.10.22 + nvidia-container-toolkit 1.8.0) and the problem still existed. The GPU drivers of all the above hosts are |
Can you also share the stacktrace of workers that are not stuck? (or is all workers stuck at the same line?) Also, is there code I can try reproducing it in our env? |
Not sure whether the following is what we need, but here we go
So I guess I need to dump the thread stacks of the processes with pid PID 7065
PID 7155
PID 7310
PID 7457
|
We're were sending requests directly to the vllm container using The container was started with the following command
Our testing prompt was in Chinese. Other prompts can also trigger the bug, and it appears that the longer the prompt is, the greater the likelihood of the bug occurring.
In our test, we could see the bug after we repeatedly sent the same request around 10 times. The requests were sent one after another from the same console, without any parallelism. |
Hmm it is actually interesting PID 7065 is running nothing. It might be the root cause of hanging. Since around that logit access code, all the workers need to call I am curious if there was any exception from python you are seeing from any logs? |
also one interesting thing is you use |
The |
Just removed |
We also encounter the same issue when deploying Qwen 72b with Tensor Parallelism. Inspired by these interesting findings, we further discover something interesting: while deploying with tp=4, CPU RAM usage continuously increases. However, when a model is deployed on a single GPU(with out tensor parallelism), the CPU RAM usage remains steady, even after processing thousands of requests. |
Can you create a separate issue for this? Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately #4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week |
FYI - this might have something to do with the custom all reduce operation. We have observed this same issue but it went away after specifying |
before this, nccl watchdog error happens several times per day, |
@changyuanzhangchina do you need to set all of them, or just one of them fixes it? I wonder if 3 itself is sufficient to fix the issue espeically |
condition 3 has great probability for the root cause while for condition 1, we don't know is there any problem for all the serving envs. Thus, we also disable it. As to condition 2, we have found memory leakage several months ago, while this may have also been fixed. |
use --disable-custom-all-reduce = True; The vllm service cannot be started |
you mean using that flag gives you the error? |
I encountered the same error when using command "vllm.entrypoints.openai.api_server" to start my server. The model server would get stuck everytime I post some concurrent and long requests. But I temporarily got it solved by rewriting openai format http services and downgraded vllm into 0.3.0.(formely 0.4.0) i.e., I initialized my vllm server using AsyncLLMEngine, and wrote my own openai format routers as my entrypoint. I have tried nealy one hundred requests and the error seemed to disappear, not sure where the problem lies though. I'll continously share updates if anything new. |
it'd be also great to try the latest master to see if it fixes the issue (or after 0.4.3 is released) because #4557 could be the root cause if you see hangs from long context size |
Disabling the custom all-reduce functionality with the It's worth noting that this issue might not be directly related to #4557 for a couple of reasons:
|
@itechbear Glad that this does resolve your issue - I suspect it has something to do with the topology of the GPUs when not all of them in the box are used for serving. |
循着问题找到这里,在0.5.0.post1版本上,按照仓库的dockerfile编译的docker镜像,qwen14B运行在4张4090上,--disable-custom-all-reduce没有能够解决我的问题 |
Today is July 5th. The above three strategies cannot solve this problem. Is there any other tips to deal with it? |
for us, this three is enough
|
|
can this problem fixed? |
UPDATE on 2024-05-23
Workaround: Use the
--disable-custom-all-reduce
flag when starting the vLLM instance. Thanks @ywang96 !Following is the original post
🐛 Describe the bug
Summary
A model execution thread hangs at
_random_sample (vllm/model_executor/layers/sampler.py:292)
mysteriously during inference, and the corresponding code at that line israndom_samples = random_samples.cpu()
What happened
We upgraded vLLM from
v0.3.3
to0.4.x
, but found vLLM occasionally got stuck and refused to serve requests. From the vLLM log, we saw that a request never got finished. After we dug deeper, we found that it was because a thread got stuck during execution.Your current environment
vLLM was running inside a docker container. The following was collected from inside the container.
Stacktrace
Request Log
The Running: 1 reqs never changed to Running: 0 reqs
NCCL Error
After some time, it complained that there was an NCCL timeout issue.
Thread stack
We dumped the thread and found that it got stuck during sampling.
Host software
GPUs: A800 x 8 (single node, multi-GPU)
NVIDIA Driver: 525.85.12
NVIDIA GPU plugin related software:
The text was updated successfully, but these errors were encountered: