Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39

Closed
1 task done
tstescoTT opened this issue Dec 6, 2024 · 2 comments
Closed
1 task done
Assignees
Labels
bug Something isn't working

Comments

@tstescoTT
Copy link

tstescoTT commented Dec 6, 2024

Your current environment

Using vLLM Docker container with vLLM serving Llama 3.1 70B Instruct https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b

Model Input Dumps

No response

🐛 Describe the bug

Prefill 32k to 64k causes assertion to trigger at https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py#L180.

This brings down the vLLM model process and then the server once the heartbeat times out.

ERROR 12-06 14:22:34 engine.py:159]   File "/tt-metal/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py", line 200, in prepare_inputs
ERROR 12-06 14:22:34 engine.py:159]     self.validate_input_shape(inp_ids, mode)
ERROR 12-06 14:22:34 engine.py:159]   File "/tt-metal/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py", line 180, in validate_input_shape
ERROR 12-06 14:22:34 engine.py:159]     assert (
ERROR 12-06 14:22:34 engine.py:159] AssertionError: Prefill only supports seq_len < 32768
ERROR:    Exception in ASGI application
...
ERROR 12-06 14:25:41 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 12-06 14:25:41 client.py:250] NoneType: None
DEBUG 12-06 14:25:41 client.py:144] Shutting down MQLLMEngineClient check health loop due to timeout

For reliability client data should not be able to crash the vLLM server.

Client receives incomplete response, e.g. when using aiohttp:

aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>

For prefill greater than max supported context, a 400 error should be returned to the client stating that the model max supported context has been exceeded and they must reduce the context length to the maximum supported amount.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@skhorasganiTT
Copy link

Closing as this is a duplicate of #29

@skhorasganiTT
Copy link

Addressed in tenstorrent/tt-metal#15880

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants