Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

Closed
AnyISalIn opened this issue Apr 26, 2024 · 2 comments
Closed

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

AnyISalIn opened this issue Apr 26, 2024 · 2 comments
Labels
usage How to use vllm

Comments

@AnyISalIn
Copy link
Contributor

Your current environment

Thank you for the teamwork on FP8 quantization. However, I am currently loading FP16 using the arguments --quantization fp8 --kv-cache-dtype fp8, and it appears that the VRAM usage is the same as with FP16. How can we decrease VRAM usage when specifying FP8 arguments?

How would you like to use vllm

I believe that FP8 should require less than 50% of the VRAM usage compared to FP16.

@AnyISalIn AnyISalIn added the usage How to use vllm label Apr 26, 2024
@pcmoritz
Copy link
Collaborator

Currently with #4118, the weights are loaded in FP16 and then converted to FP8. So you will still need enough VRAM to fit the FP16 weights.

This will change soon once we support FP8 checkpoints :)

@AnyISalIn
Copy link
Contributor Author

Currently with #4118, the weights are loaded in FP16 and then converted to FP8. So you will still need enough VRAM to fit the FP16 weights.

This will change soon once we support FP8 checkpoints :)

Thanks for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants