Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to config vllm gpu_memory_utilization? #636

Closed
zch-cc opened this issue Jul 18, 2023 · 5 comments · Fixed by #664
Closed

How to config vllm gpu_memory_utilization? #636

zch-cc opened this issue Jul 18, 2023 · 5 comments · Fixed by #664

Comments

@zch-cc
Copy link

zch-cc commented Jul 18, 2023

Hi team, I am trying using codegen2.5 7b model on tgi with A100 40GB and it gives me out of memory error because of vllm. I wonder if there is any way I can config gpu_memory_utilization in the code such that the vllm does not reserve too memory beforehand

@Narsil
Copy link
Collaborator

Narsil commented Jul 18, 2023

We're working on some Quality of life to help with that : #630

otherwise, try to look at the error message, it should give you the name of the parameters you can tweak to fix the RAM issue

--max-total-tokens
--max-batch-prefill-tokens
--max-input-length
# and maybe a couple others I usually only tweak those

I'm writing from memory, so just check the error message for the correct names or text-generation-launcher --help

@zch-cc
Copy link
Author

zch-cc commented Jul 18, 2023

Hi @Narsil, Thanks for the help. I try to tweek the number even I set --max-batch-prefill-tokens=1, and --max-batch-total-tokens=2 but it is still out of memory. What other things can I do? For the context, when I try not to use flash attention on llamamodel, it can work. When I use flash attention and not vllm, it also works. so there must be sth going on the vllm side.

@Narsil
Copy link
Collaborator

Narsil commented Jul 20, 2023

Can you share a reproducible example ? And the full stacktrace ?

@jameswu2014
Copy link

I have the same issue, kv cache warmup casue OOM

@wuminghui-coder
Copy link

Just lower the value of gpu_memory_utilization a bit, or reduce it further. The problem has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants