[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

AnyISalIn · 2024-04-26T03:17:09Z

Your current environment

Thank you for the teamwork on FP8 quantization. However, I am currently loading FP16 using the arguments --quantization fp8 --kv-cache-dtype fp8, and it appears that the VRAM usage is the same as with FP16. How can we decrease VRAM usage when specifying FP8 arguments?

How would you like to use vllm

I believe that FP8 should require less than 50% of the VRAM usage compared to FP16.

pcmoritz · 2024-04-26T18:20:38Z

Currently with #4118, the weights are loaded in FP16 and then converted to FP8. So you will still need enough VRAM to fit the FP16 weights.

This will change soon once we support FP8 checkpoints :)

AnyISalIn · 2024-04-27T03:08:27Z

Currently with #4118, the weights are loaded in FP16 and then converted to FP8. So you will still need enough VRAM to fit the FP16 weights.

This will change soon once we support FP8 checkpoints :)

Thanks for your reply!

AnyISalIn added the usage How to use vllm label Apr 26, 2024

AnyISalIn closed this as completed Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

AnyISalIn commented Apr 26, 2024

pcmoritz commented Apr 26, 2024

AnyISalIn commented Apr 27, 2024

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

[Usage]: How to decrease VRAM Usage when specifying FP8 arguments #4387

Comments

AnyISalIn commented Apr 26, 2024

Your current environment

How would you like to use vllm

pcmoritz commented Apr 26, 2024

AnyISalIn commented Apr 27, 2024