Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm offline inference #1022

Open
Shnumshnub opened this issue Oct 31, 2024 · 4 comments
Open

vllm offline inference #1022

Shnumshnub opened this issue Oct 31, 2024 · 4 comments
Labels
bug Something isn't working Inf2

Comments

@Shnumshnub
Copy link

server: inf2.8xlarge
vllm version: 0.6.3.post2.dev77+g2394962d.neuron215

Desctiption
Hellow! I am trying to run the code below (the code was taken here). I managed to run it with the model(TinyLlama/TinyLlama-1.1B-Chat-v1.0) from the original example, but when I tried to run it with llama-3.1-8b-Instruct, vllm crashes with the following error:

Traceback (most recent call last):
  File "/home/user/workspace/offline_inference.py", line 21, in <module>
    llm = LLM(
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/utils.py", line 1053, in inner
    return fn(*args, **kwargs)
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 198, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 578, in from_engine_args
    engine = cls(
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 337, in __init__
    self.model_executor = executor_class(
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/neuron_executor.py", line 25, in _init_executor
    self._init_worker()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/executor/neuron_executor.py", line 41, in _init_worker
    self.driver_worker.load_model()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/worker/neuron_worker.py", line 57, in load_model
    self.model_runner.load_model()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/worker/neuron_model_runner.py", line 114, in load_model
    self.model = get_neuron_model(
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/model_executor/model_loader/neuron.py", line 203, in get_neuron_model
    model.load_weights(model_config.model,
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/model_executor/model_loader/neuron.py", line 112, in load_weights
    self.model.to_neuron()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/base.py", line 85, in to_neuron
    self.compile()
  File "/home/user/workspace/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers_neuronx/base.py", line 64, in compile
    kernel.neff_bytes = neff_bytes_futures[hash_hlo(kernel.hlo_module)].result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/user/neuroncc_compile_workdir/4ccb00bb-a4eb-432f-bda1-2335fa4c6a20/model.MODULE_4af556d64db45251cbb6+39f12043.hlo_module.pb', '--output', '/tmp/user/neuroncc_compile_workdir/4ccb00bb-a4eb-432f-bda1-2335fa4c6a20/model.MODULE_4af556d64db45251cbb6+39f12043.neff', '--model-type=transformer', '--auto-cast=none', '--execute-repetition=1', '--verbose=35']' returned non-zero exit status 70.

Looks like problem with model compilation:

2024-10-31 09:37:23.000571:  200724  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/user/neuroncc_compile_workdir/58012403-a7b8-44a6-84e1-20897a51edb5/model.MODULE_3c368f070c9bfaf4b314+39f12043.hlo_module.pb', '--output', '/tmp/user/neuroncc_compile_workdir/58012403-a7b8-44a6-84e1-20897a51edb5/model.MODULE_3c368f070c9bfaf4b314+39f12043.neff', '--model-type=transformer', '--auto-cast=none', '--execute-repetition=1', '--verbose=35']: 2024-10-31T09:37:23Z [XCG863] (uint32<8 x 8> $42930:42930)0: ISA check failed - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

2024-10-31 09:37:23.000572:  200724  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/user/neuroncc_compile_workdir/58012403-a7b8-44a6-84e1-20897a51edb5/model.MODULE_3c368f070c9bfaf4b314+39f12043.hlo_module.pb after 0 retries.

Full Log
full.log

Source code

import os

from vllm import LLM, SamplingParams

# creates XLA hlo graphs for all the context length buckets.
os.environ['NEURON_CONTEXT_LENGTH_BUCKETS'] = "128,512,1024,2048"
# creates XLA hlo graphs for all the token gen buckets.
os.environ['NEURON_TOKEN_GEN_BUCKETS'] = "128,512,1024,2048"

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as
    # max sequence length when targeting neuron device.
    # Currently, this is a known limitation in continuous batching support
    # in transformers-neuronx.
    # TODO(liangfu): Support paged-attention in transformers-neuronx.
    max_model_len=2048,
    block_size=2048,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection,
    # or explicitly assigned.
    device="neuron",
    tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
@delongmeng-aws
Copy link

Thank you for reporting this issue @Shnumshnub. Our team is looking into this and will let you know if any update or if we need any further information from you.

@Shnumshnub
Copy link
Author

@delongmeng-aws, great, thank you!

More information about my environment

pip list | grep neuronx
aws-neuronx-runtime-discovery     2.9
libneuronxla                      2.0.4115.0
neuronx-cc                        2.15.141.0+d3cfc8ca
torch-neuronx                     2.1.2.2.3.1
transformers-neuronx              0.12.313

pip list | grep torch
torch                             2.1.2
torch-neuronx                     2.1.2.2.3.1
torch-xla                         2.1.4
torchvision                       0.16.2
ollecting environment information...
WARNING 11-01 11:55:47 _custom_ops.py:18] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1024-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R13 Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5299.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-7,16-23
NUMA node1 CPU(s):                  8-15,24-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected

@delongmeng-aws
Copy link

Thank you @Shnumshnub! We were able to reproduce the issue, and are further looking into the root cause and potential fix.

@Shnumshnub
Copy link
Author

@delongmeng-aws , thank you!
As a workaround, I downgraded the vllm version to 0.6.2 and it worked.

@aws-taylor aws-taylor added bug Something isn't working Inf2 labels Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Inf2
Projects
None yet
Development

No branches or pull requests

3 participants