[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

tstescoTT · 2024-12-06T22:41:16Z

Your current environment

Using vLLM Docker container with vLLM serving Llama 3.1 70B Instruct https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b

Model Input Dumps

No response

🐛 Describe the bug

Prefill 32k to 64k causes assertion to trigger at https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py#L180.

This brings down the vLLM model process and then the server once the heartbeat times out.

ERROR 12-06 14:22:34 engine.py:159]   File "/tt-metal/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py", line 200, in prepare_inputs
ERROR 12-06 14:22:34 engine.py:159]     self.validate_input_shape(inp_ids, mode)
ERROR 12-06 14:22:34 engine.py:159]   File "/tt-metal/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py", line 180, in validate_input_shape
ERROR 12-06 14:22:34 engine.py:159]     assert (
ERROR 12-06 14:22:34 engine.py:159] AssertionError: Prefill only supports seq_len < 32768
ERROR:    Exception in ASGI application
...
ERROR 12-06 14:25:41 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 12-06 14:25:41 client.py:250] NoneType: None
DEBUG 12-06 14:25:41 client.py:144] Shutting down MQLLMEngineClient check health loop due to timeout

For reliability client data should not be able to crash the vLLM server.

Client receives incomplete response, e.g. when using aiohttp:

aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>

For prefill greater than max supported context, a 400 error should be returned to the client stating that the model max supported context has been exceeded and they must reduce the context length to the maximum supported amount.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

skhorasganiTT · 2024-12-06T22:55:54Z

Closing as this is a duplicate of #29

skhorasganiTT · 2024-12-10T22:22:15Z

Addressed in tenstorrent/tt-metal#15880

tstescoTT added the bug Something isn't working label Dec 6, 2024

skhorasganiTT mentioned this issue Dec 6, 2024

[Bug] vLLM server crashes upon assertions instead of throwing errors to client (e.g. fails when requests with different temperatures are sent) #29

Closed

skhorasganiTT closed this as completed Dec 6, 2024

jvasilje added P0 and removed P0 labels Dec 9, 2024

skhorasganiTT mentioned this issue Dec 10, 2024

[Llama3-70b] Separate vllm generator class and add prompt length validation in input processor tenstorrent/tt-metal#15880

Merged

6 tasks

skhorasganiTT reopened this Dec 10, 2024

skhorasganiTT self-assigned this Dec 10, 2024

skhorasganiTT closed this as completed Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

tstescoTT commented Dec 6, 2024 •

edited

Loading

skhorasganiTT commented Dec 6, 2024

skhorasganiTT commented Dec 10, 2024

[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

Comments

tstescoTT commented Dec 6, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

skhorasganiTT commented Dec 6, 2024

skhorasganiTT commented Dec 10, 2024

tstescoTT commented Dec 6, 2024 •

edited

Loading