-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to config vllm gpu_memory_utilization? #636
Comments
We're working on some Quality of life to help with that : #630 otherwise, try to look at the error message, it should give you the name of the parameters you can tweak to fix the RAM issue
I'm writing from memory, so just check the error message for the correct names or |
Hi @Narsil, Thanks for the help. I try to tweek the number even I set --max-batch-prefill-tokens=1, and --max-batch-total-tokens=2 but it is still out of memory. What other things can I do? For the context, when I try not to use flash attention on llamamodel, it can work. When I use flash attention and not vllm, it also works. so there must be sth going on the vllm side. |
Can you share a reproducible example ? And the full stacktrace ? |
I have the same issue, kv cache warmup casue OOM |
Just lower the value of gpu_memory_utilization a bit, or reduce it further. The problem has been resolved. |
Hi team, I am trying using codegen2.5 7b model on tgi with A100 40GB and it gives me out of memory error because of vllm. I wonder if there is any way I can config gpu_memory_utilization in the code such that the vllm does not reserve too memory beforehand
The text was updated successfully, but these errors were encountered: