Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How to run FP8 inference #453

Closed
1 task done
warlock135 opened this issue Nov 3, 2024 · 5 comments · Fixed by #502
Closed
1 task done

[Usage]: How to run FP8 inference #453

warlock135 opened this issue Nov 3, 2024 · 5 comments · Fixed by #502

Comments

@warlock135
Copy link

warlock135 commented Nov 3, 2024

Your current environment

Version: v0.5.3.post1+Gaudi-1.18.0
Models: [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
Hardware: 8xHL-225

How would you like to use vllm

I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:

QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when starting inference, vLLM reported an error.

ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause

In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@afierka-intel
Copy link

Hello @warlock135.

Thank you for very detailed description! Tiny detail I missed is which branch you used, however please use habana_main branch and then you can set following environments:

export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600  # this timeout you experience right now, value is in seconds
export VLLM_RPC_TIMEOUT=600000  # this timeout you can experience in the future, value is in microseconds

You can test your server skipping warmup stage via this env:

export VLLM_SKIP_WARMUP=true

This can help you to save a lot of time for warmup.
NOTE: We do not recommend to run vLLM server without warmup in production environment, however this option is good for development and testing.

Summarizing this command should help you quickly verify if the configuration is working fine:

VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

and this should works fine in production:

LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

Once the recommended solution works for you please close the issue, otherwise I'm open for further discussion.

@warlock135
Copy link
Author

This configuration works correctly on the latest release branch (v0.5.3.post1+Gaudi-1.18.0):

VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when I tried using the habana-main branch, an inference error occurred as shown below:

ERROR 11-11 04:29:13 engine.py:143] TypeError("PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset'")
ERROR 11-11 04:29:13 engine.py:143] Traceback (most recent call last):
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 141, in start
ERROR 11-11 04:29:13 engine.py:143]     self.run_engine_loop()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 204, in run_engine_loop
ERROR 11-11 04:29:13 engine.py:143]     request_outputs = self.engine_step()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 222, in engine_step
ERROR 11-11 04:29:13 engine.py:143]     raise e
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/multiprocessing/engine.py", line 213, in engine_step
ERROR 11-11 04:29:13 engine.py:143]     return self.engine.step()
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/engine/llm_engine.py", line 1466, in step
ERROR 11-11 04:29:13 engine.py:143]     outputs = self.model_executor.execute_model(
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/ray_hpu_executor.py", line 319, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     return super().execute_model(execute_model_req)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/executor/ray_hpu_executor.py", line 312, in _driver_execute_model
ERROR 11-11 04:29:13 engine.py:143]     return self.driver_worker.execute_method("execute_model",
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 481, in execute_method
ERROR 11-11 04:29:13 engine.py:143]     raise e
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 472, in execute_method
ERROR 11-11 04:29:13 engine.py:143]     return executor(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     output = self.model_runner.execute_model(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-11 04:29:13 engine.py:143]     return func(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/hpu_model_runner.py", line 2124, in execute_model
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.model.forward(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 726, in forward
ERROR 11-11 04:29:13 engine.py:143]     return wrapped_hpugraph_forward(
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 576, in wrapped_hpugraph_forward
ERROR 11-11 04:29:13 engine.py:143]     return orig_fwd(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/worker/hpu_model_runner.py", line 385, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.model(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1565, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     return forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 567, in forward
ERROR 11-11 04:29:13 engine.py:143]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 347, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states, residual = layer(positions, hidden_states,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 262, in forward
ERROR 11-11 04:29:13 engine.py:143]     hidden_states = self.self_attn(positions=positions,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/model_executor/models/llama.py", line 192, in forward
ERROR 11-11 04:29:13 engine.py:143]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/attention/layer.py", line 99, in forward
ERROR 11-11 04:29:13 engine.py:143]     return self.impl.forward(query,
ERROR 11-11 04:29:13 engine.py:143]   File "/vllm-fork/vllm/attention/backends/hpu_attn.py", line 182, in forward
ERROR 11-11 04:29:13 engine.py:143]     key_cache = self.k_cache(key, key_cache, block_indices,
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143] TypeError: PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset'

@tianmu-li
Copy link

Facing the same issue on habana_main. I think it's related to #289.

@xuechendi
Copy link

@warlock135 , the error you met in latest log is caused by "latest change in habana_main is not yet updated in Intel Neural Compessor" yet.

intel/neural-compressor#2065
Please follow this fix in this PR to change the INC codes locally, once PR merged, I'll update the INC version in vLLM-fork.

@xuechendi
Copy link

Update INC version as below to fix this issue

pip install -U git+https://github.com/intel/neural-compressor.git@b196432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants