[Usage]: How to run FP8 inference #453

warlock135 · 2024-11-03T05:56:35Z

Your current environment

Version: v0.5.3.post1+Gaudi-1.18.0
Models: [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
Hardware: 8xHL-225

How would you like to use vllm

I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:

QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when starting inference, vLLM reported an error.

ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause

In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

afierka-intel · 2024-11-04T09:23:06Z

Hello @warlock135.

Thank you for very detailed description! Tiny detail I missed is which branch you used, however please use habana_main branch and then you can set following environments:

export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600  # this timeout you experience right now, value is in seconds
export VLLM_RPC_TIMEOUT=600000  # this timeout you can experience in the future, value is in microseconds

You can test your server skipping warmup stage via this env:

export VLLM_SKIP_WARMUP=true

This can help you to save a lot of time for warmup.
NOTE: We do not recommend to run vLLM server without warmup in production environment, however this option is good for development and testing.

Summarizing this command should help you quickly verify if the configuration is working fine:

VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

and this should works fine in production:

LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

Once the recommended solution works for you please close the issue, otherwise I'm open for further discussion.

warlock135 · 2024-11-12T02:07:15Z

This configuration works correctly on the latest release branch (v0.5.3.post1+Gaudi-1.18.0):

VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when I tried using the habana-main branch, an inference error occurred as shown below:

ERROR 11-11 04:29:13 engine.py:143] TypeError("PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset'")
ERROR 11-11 04:29:13 engine.py:143] Traceback (most recent call last):
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 141, in start
ERROR 11-11 04:29:13 engine.py:143]     self.run_engine_loop()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 204, in run_engine_loop
ERROR 11-11 04:29:13 engine.py:143]     request_outputs = self.engine_step()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 222, in engine_step
ERROR 11-11 04:29:13 engine.py:143]     raise e
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 213, in engine_step
ERROR 11-11 04:29:13 engine.py:143]     return self.engine.step()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/llm_engine.py", line 1466, in step
ERROR 11-11 04:29:13 engine.py:143]     outputs = self.model_executor.execute_model(
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/ray_hpu_executor.py", line 319, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     return super().execute_model(execute_model_req)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/ray_hpu_executor.py", line 312, in _driver_execute_model
ERROR 11-11 04:29:13 engine.py:143]     return self.driver_worker.execute_method("execute_model",
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 481, in execute_method
ERROR 11-11 04:29:13 engine.py:143]     raise e
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 472, in execute_method
ERROR 11-11 04:29:13 engine.py:143]     return executor(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     output = self.model_runner.execute_model(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-11 04:29:13 engine.py:143]     return func(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/hpu_model_runner.py", line 2124, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.model.forward(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 726, in forward
ERROR 11-11 04:29:13 engine.py:143]     return wrapped_hpugraph_forward(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 576, in wrapped_hpugraph_forward
ERROR 11-11 04:29:13 engine.py:143]     return orig_fwd(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/hpu_model_runner.py", line 385, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.model(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1565, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     return forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 567, in forward
ERROR 11-11 04:29:13 engine.py:143]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 347, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states, residual = layer(positions, hidden_states,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 262, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.self_attn(positions=positions,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 192, in forward
ERROR 11-11 04:29:13 engine.py:143]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/attention/layer.py", line 99, in forward
ERROR 11-11 04:29:13 engine.py:143]     return self.impl.forward(query,
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/attention/backends/hpu_attn.py", line 182, in forward
ERROR 11-11 04:29:13 engine.py:143]     key_cache = self.k_cache(key, key_cache, block_indices,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143] TypeError: PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset'

tianmu-li · 2024-11-12T19:36:17Z

Facing the same issue on habana_main. I think it's related to #289.

xuechendi · 2024-11-13T22:34:23Z

@warlock135 , the error you met in latest log is caused by "latest change in habana_main is not yet updated in Intel Neural Compessor" yet.

intel/neural-compressor#2065
Please follow this fix in this PR to change the INC codes locally, once PR merged, I'll update the INC version in vLLM-fork.

xuechendi · 2024-11-15T01:31:08Z

Update INC version as below to fix this issue

pip install -U git+https://github.com/intel/neural-compressor.git@b196432

tianmu-li mentioned this issue Nov 12, 2024

[Bug]: FP8 not working in habana_main #488

Closed

1 task

xuechendi mentioned this issue Nov 15, 2024

[BUGFIX]fix FP8 failing issue on habana_main [PatchedVLLMKVCache fwd rror] #502

Merged

michalkuligowski closed this as completed in #502 Nov 18, 2024

michalkuligowski closed this as completed in c79982d Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to run FP8 inference #453

[Usage]: How to run FP8 inference #453

warlock135 commented Nov 3, 2024 •

edited

Loading

afierka-intel commented Nov 4, 2024

warlock135 commented Nov 12, 2024

tianmu-li commented Nov 12, 2024

xuechendi commented Nov 13, 2024

xuechendi commented Nov 15, 2024

[Usage]: How to run FP8 inference #453

[Usage]: How to run FP8 inference #453

Comments

warlock135 commented Nov 3, 2024 • edited Loading

Your current environment

How would you like to use vllm

Before submitting a new issue...

afierka-intel commented Nov 4, 2024

warlock135 commented Nov 12, 2024

tianmu-li commented Nov 12, 2024

xuechendi commented Nov 13, 2024

xuechendi commented Nov 15, 2024

warlock135 commented Nov 3, 2024 •

edited

Loading