-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: How to run FP8 inference #453
Comments
Hello @warlock135. Thank you for very detailed description! Tiny detail I missed is which branch you used, however please use export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 # this timeout you experience right now, value is in seconds
export VLLM_RPC_TIMEOUT=600000 # this timeout you can experience in the future, value is in microseconds You can test your server skipping warmup stage via this env: export VLLM_SKIP_WARMUP=true This can help you to save a lot of time for warmup. Summarizing this command should help you quickly verify if the configuration is working fine: VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu and this should works fine in production: LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu Once the recommended solution works for you please close the issue, otherwise I'm open for further discussion. |
This configuration works correctly on the latest release branch (v0.5.3.post1+Gaudi-1.18.0):
However, when I tried using the habana-main branch, an inference error occurred as shown below:
|
Facing the same issue on habana_main. I think it's related to #289. |
@warlock135 , the error you met in latest log is caused by "latest change in habana_main is not yet updated in Intel Neural Compessor" yet. intel/neural-compressor#2065 |
Update INC version as below to fix this issue
|
Your current environment
How would you like to use vllm
I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:
However, when starting inference, vLLM reported an error.
In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: