-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: Double free! #1556
Comments
Hi @mshumer Could you provide a reproducible example? |
Same issue here, but it doesn't reproduce every time. Restarting vllm fixes it. I'm on an Nvidia 3090, and nvidia-smi shows: Driver Version: 530.30.02 and CUDA Version: 12.1 Thanks.
|
Has a fix to this problem been found? |
having the same problem here.. not sure the concrete reason for this bug |
Stumbled about the same issue today. Using 0.3.2 with TheBloke/openchat-3.5-0106-AWQ 1 GPU 24 GB Ideas:
`
|
related to #1584 |
related to #2271 |
and this one #1930 |
@WoosukKwon still seems to be the case in 0.3.2 - any advice what to check? how to figure out, what it could cause. #1584 seems that he "fixed" something. |
@WoosukKwon I can reproduce it even after a restart. Model OpenChat 3.5 0106
|
Lowering the max_tokens amount didn't help. |
Btw - it seems when this happens, a request keeps pending. After 10 failed generations vllm is dead. We have also spotted, that I seems to only happen for a request with Sample Params n=3, best_of=3. We have not seen it with n=1, best_of=1 |
at least I could get rid of the hanging requests, which only happens if you use ray. --worker-use-ray when not using ray with just one gpu it seems to be me forgiving. |
I didnt see this one before i opened mine, but it seems its the same thing. #3295 |
@manzke could you please provide:
|
I'm on it to reproduce it with pure curl commands. |
attached the curl to reproduce it. vllm is started in a docker container with: |
Ok, so when I set n=1, best_of=1 it works. When I have n=3, best_of=3 it fails. |
Machine is A10G (g5.2xlarge and similar) - GPU has 24GB. Also reproduced on a local 3090. |
Seems @br3no spotted the cause. We are using OpenChat 3.5-1210 which was trained on Mistral v0.1. At this time config.json had a sliding window of 4096. In Mistral v0.2 they "fixed" it, because Mistral doesn't support it at all. I'm building a vllm image with a patched model, where the config.json is patched. |
I can confirm that it is related to the sliding window. We deployed the OpenChat 3.5 1210 again with a modified config.json, which has set the sliding window to null and vllm doesn't fail anymore. So does it means the sliding window implementation has a bug? Should it not be set in the config.json? That's also the reason why one had an issue with Mistral v0.1 - which has a sliding window set. |
@hmellor I have found the issue and opened a PR that fixes it. Let me know if there are any open questions. |
Getting this error really frequently when querying a Mistral-based model: ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=9180, ref_count=0) is already freed.
Interestingly, it
primarilyhappens when I'm hitting the endpoint from pure python/request, not when I'm using Postman.But, it has happened when using Postman and when just querying the model locally, though less frequently.
The text was updated successfully, but these errors were encountered: