-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016
Comments
@zoltan-fedor Thanks for reporting the bug. Could you please use without |
thanks @WoosukKwon, unfortunately even without the
Back to version v0.5.4 again. |
Two minutes later the next error:
This seems to be the same |
Hi. I had the exact same issue. There is an obvious behavior that is causing this issue. Below are the options I started VLLM with.
Below is the log when the bug occurred.
|
The source of @zoltan-fedor re: the issues you are seeing with illegal memory access on
Would help us a lot of reproduce and resolve the issue |
Are these logs from v0.5.4 or v0.5.5? Update: looks like v0.5.5 based on line numbers |
Update: I am able to reproduce the issue sporadically |
Update on |
@zoltan-fedor - I reproed the issue once, but have not been able to retrigger with |
@robertgshaw2-neuralmagic , sorry, we do not have a way to reproduce it either. |
No worries. I am sure it will occur soon + I can look into it further. |
Is there anything more you can share about your environment? E.g. can you run |
There isn't much to share. No modification, we run it as-is. |
sounds good. thanks |
Indeed it reproduced. As expected, it is an illegal memory access in flash attention due to prefix caching. I will dig in further. took about 4 hours to trigger it |
@robertgshaw2-neuralmagic , I have also seen the same
|
So, with the following command?: --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768 |
That is correct. |
Thanks. Will run this in the background today and see if I can reproduce. The illegal memory access I had before seemed to occur in attention, so perhaps it is something related to chunked-prefill rather than prefix caching (since chunked-prefill is on by default is max-len >32k) |
I've encountered this error too. |
@liulisi16323 can you share your launch command? I ran @zoltan-fedor’s launch command for about 1 day (without prefix caching) and have not been able to trigger this issue @zoltan-fedor - can you share driver and CUDA version? I will try to make an enc that more closely matches yours |
Driver Version: 535.183.01 CUDA Version: 12.4 |
NVIDIA A800, Driver Version: 525.105.17,docker images: vllm/vllm-openai:v0.6.0, launch command: |
problem solved in v0.6.1.post1. |
Thanks @ashgold, I have upgraded to this latest version and will monitor whether the issue arises again. |
I just encountered the same issue (I think) on 0.6.1.post2. I only see the log output below, no stack trace.
|
No, I still get the error |
Can you share the details of the test environment and issues when they occur? If possible, I would like to reproduce it. |
@ashgold
|
At one point, the LLM on a certain card crashed and exited directly, and at the same time, the requests I sent to other LLMs also became ineffective. It's very strange. |
This seems to be a different issue to the one I'm experiencing, and I'd suggest opening a separate bug to follow up on it. |
v0.6.1.post2, I still get the error under high concurrency conditions Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): |
Your current environment
Running the standard v0.5.5 docker image from your Dockerhub repo without anything additional added to it.
🐛 Describe the bug
When using Llama 3.1 70b AWQ model running on 4 A10G 24Gb GPUs with args:
vLLM crashes and requires a full restart. Error:
The issue is random, the same query does NOT reproduce it.
We have upgraded 6 hours ago and since then this happened 3 times.
We now need to downgrade and consider v0.5.5 a buggy release.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: