-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm offline inference #1022
Comments
Thank you for reporting this issue @Shnumshnub. Our team is looking into this and will let you know if any update or if we need any further information from you. |
@delongmeng-aws, great, thank you! More information about my environment pip list | grep neuronx
aws-neuronx-runtime-discovery 2.9
libneuronxla 2.0.4115.0
neuronx-cc 2.15.141.0+d3cfc8ca
torch-neuronx 2.1.2.2.3.1
transformers-neuronx 0.12.313
pip list | grep torch
torch 2.1.2
torch-neuronx 2.1.2.2.3.1
torch-xla 2.1.4
torchvision 0.16.2
|
Thank you @Shnumshnub! We were able to reproduce the issue, and are further looking into the root cause and potential fix. |
@delongmeng-aws , thank you! |
server: inf2.8xlarge
vllm version: 0.6.3.post2.dev77+g2394962d.neuron215
Desctiption
Hellow! I am trying to run the code below (the code was taken here). I managed to run it with the model(TinyLlama/TinyLlama-1.1B-Chat-v1.0) from the original example, but when I tried to run it with llama-3.1-8b-Instruct, vllm crashes with the following error:
Looks like problem with model compilation:
Full Log
full.log
Source code
The text was updated successfully, but these errors were encountered: