[Bug]: Client prompt exceeding model MAX_PREFILL_SEQ_LEN causes vLLM server crash #39
Closed
1 task done
Labels
bug
Something isn't working
Your current environment
Using vLLM Docker container with vLLM serving Llama 3.1 70B Instruct https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b
Model Input Dumps
No response
🐛 Describe the bug
Prefill 32k to 64k causes assertion to trigger at https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_model_optimized.py#L180.
This brings down the vLLM model process and then the server once the heartbeat times out.
For reliability client data should not be able to crash the vLLM server.
Client receives incomplete response, e.g. when using aiohttp:
For prefill greater than max supported context, a
400
error should be returned to the client stating that the model max supported context has been exceeded and they must reduce the context length to the maximum supported amount.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: