-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA 12.1 vllm==0.2.3 Double Free #1930
Comments
Hi @tjtanaa thanks for reporting the bug! Which model are you using? Is it Mistral? |
Yes. I am using openhermes-2.5 which is based on Mistral |
It is happening for me as well, cuda 12.1, vllm 0.2.6 with Mixtral 8x7B, for long prompts. |
@WoosukKwon any tips on this? |
+1 |
+1, Samme issue here using CUDA/12.1.1, Python/3.10.4-GCCcore-11.3.0, vllm==0.2.3. Happened after 5-10 inferences with a lora fine tuned mistral 7b model EDIT: In our case the fine-tuned model was trained with 1024 input tokens, when this was exceeded it caused the double free error. |
I tried this with FastChat that uses vLLM backend:
Both inputs:
raises the following error
The text was updated successfully, but these errors were encountered: