Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU #5199

Closed
rikitomo opened this issue Jun 2, 2024 · 14 comments
Closed

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU #5199

rikitomo opened this issue Jun 2, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@rikitomo
Copy link

rikitomo commented Jun 2, 2024

Your current environment

I am currently using a T4 instance on Google Colaboratory.

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.9
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             4399.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             256 KiB (1 instance)
L3 cache:                             55 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.18.0
[pip3] torchvision==0.18.0+cu121
[pip3] triton==2.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-1		N/A		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I encounter an error when attempting to apply LoRA in vllm. Here are the details of the problem:

  • The error occurs when enable_lora=True is added.
  • The error only occurs when using a T4 GPU. It does not occur with L4 or A100 GPUs.

Below is a sample code snippet that reproduces the issue:

!pip install torch==2.3.0+cu121 vllm==0.4.3

import torch
import vllm
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

model_path = "microsoft/Phi-3-mini-4k-instruct"
VLLM_TENSOR_PARALLEL_SIZE = 1
VLLM_GPU_MEMORY_UTILIZATION = 0.85

llm = vllm.LLM(
    model=model_path,
    tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE,
    gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION,
    trust_remote_code=True,
    dtype=torch.float16,
    enforce_eager=True,
    enable_lora=True
  )

which yields the following output:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-5-d4df0971412c>](https://localhost:8080/#) in <cell line: 1>()
----> 1 llm = vllm.LLM(
      2     model=model_path,
      3     tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE,
      4     gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION,
      5     trust_remote_code=True,

22 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    142             **kwargs,
    143         )
--> 144         self.llm_engine = LLMEngine.from_engine_args(
    145             engine_args, usage_context=UsageContext.LLM_CLASS)
    146         self.request_counter = Counter()

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args, usage_context)
    357 
    358         # Create the LLM engine.
--> 359         engine = cls(
    360             **engine_config.to_dict(),
    361             executor_class=executor_class,

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, vision_language_config, speculative_config, decoding_config, executor_class, log_stats, usage_context)
    233 
    234         if not self.model_config.embedding_mode:
--> 235             self._initialize_kv_caches()
    236 
    237         # If usage stat is enabled, collect relevant info.

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in _initialize_kv_caches(self)
    310         """
    311         num_gpu_blocks, num_cpu_blocks = (
--> 312             self.model_executor.determine_num_available_blocks())
    313 
    314         if self.cache_config.num_gpu_blocks_override is not None:

[/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py](https://localhost:8080/#) in determine_num_available_blocks(self)
     73         underlying worker.
     74         """
---> 75         return self.driver_worker.determine_num_available_blocks()
     76 
     77     def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks) -> None:

[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in determine_num_available_blocks(self)
    152         # Execute a forward pass with dummy inputs to profile the memory usage
    153         # of the model.
--> 154         self.model_runner.profile_run()
    155 
    156         # Calculate the number of blocks that can be allocated with the

[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in profile_run(self)
    807         num_layers = self.model_config.get_num_layers(self.parallel_config)
    808         kv_caches = [None] * num_layers
--> 809         self.execute_model(seqs, kv_caches)
    810         torch.cuda.synchronize()
    811         return

[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in execute_model(self, seq_group_metadata_list, kv_caches)
    726         if self.vision_language_config:
    727             execute_model_kwargs.update({"image_input": multi_modal_input})
--> 728         hidden_states = model_executable(**execute_model_kwargs)
    729 
    730         # Compute the logits.

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py](https://localhost:8080/#) in forward(self, input_ids, positions, kv_caches, attn_metadata)
    361         attn_metadata: AttentionMetadata,
    362     ) -> torch.Tensor:
--> 363         hidden_states = self.model(input_ids, positions, kv_caches,
    364                                    attn_metadata)
    365         return hidden_states

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py](https://localhost:8080/#) in forward(self, input_ids, positions, kv_caches, attn_metadata, inputs_embeds)
    286         for i in range(len(self.layers)):
    287             layer = self.layers[i]
--> 288             hidden_states, residual = layer(
    289                 positions,
    290                 hidden_states,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py](https://localhost:8080/#) in forward(self, positions, hidden_states, kv_cache, attn_metadata, residual)
    221         if residual is None:
    222             residual = hidden_states
--> 223             hidden_states = self.input_layernorm(hidden_states)
    224         else:
    225             hidden_states, residual = self.input_layernorm(

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/layernorm.py](https://localhost:8080/#) in forward(self, x, residual)
     57             )
     58             return x, residual
---> 59         out = torch.empty_like(x)
     60         ops.rms_norm(
     61             out,

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I would appreciate it if you could explain how to apply LoRA on a T4 GPU.
This time, I am running it on phi-3, but when I tried Llama3 in a different T4×2 environment, the same error occurred.

I hope this helps you with your GitHub issue submission.
If you need any further adjustments or additional information included, please let me know!

@rikitomo rikitomo added the bug Something isn't working label Jun 2, 2024
@emillykkejensen
Copy link

Have the same issue, however is running it on an Azure VM with a T4 GPU using docker

@mgoin
Copy link
Member

mgoin commented Jun 4, 2024

Hi @rikitomo and @emillykkejensen, it is unfortunately the case that punica does not support T4 or V100, per #3197

Please follow up with this in the issue on their repo punica-ai/punica#44. Once it is addressed, we can pull in the updated kernels into vLLM - thanks!

On another note: perhaps this will be addressed by this recent work on using Triton for LoRA inference! #5036

@mgoin mgoin closed this as completed Jun 4, 2024
@jeejeelee
Copy link
Collaborator

Thanks to @mgoin for the mention.

#5036 have currently addressed this issue preliminarily, we have tested it on TITAN RTX. You can clone this branch, and build.

@emillykkejensen
Copy link

Hi @jeejeelee

Thanks a lot for the proposed fix. However, when I try to build from your branch I get the same error. I'm building inside a Docker Container, so don't know if that is the issue.

What I did:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# and then from within the container
git clone https://github.com/jeejeelee/vllm.git
cd vllm
export VLLM_INSTALL_PUNICA_KERNELS=1
pip install -e . 

Ones done building I ran:

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ \
    --quantization awq \
    --dtype half \
    --enable-lora \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --lora-modules sql-lora=jashing/tinyllama-colorist-lora/

That gave me this output:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 854/854 [00:00<00:00, 11.6MB/s]
WARNING 06-11 08:57:14 config.py:192] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-11 08:57:14 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', speculative_config=None, tokenizer='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████| 1.42k/1.42k [00:00<00:00, 25.9MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 35.2MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 18.1MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████| 69.0/69.0 [00:00<00:00, 1.30MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████| 96.0/96.0 [00:00<00:00, 1.90MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
generation_config.json: 100%|██████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 1.10MB/s]
INFO 06-11 08:57:16 selector.py:113] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-11 08:57:16 selector.py:44] Using XFormers backend.
INFO 06-11 08:57:18 weight_utils.py:206] Using model weights format ['*.safetensors']
model.safetensors: 100%|████████████████████████████████████████████████████████████████████| 766M/766M [00:02<00:00, 262MB/s]
INFO 06-11 08:57:22 model_runner.py:146] Loading model weights took 0.7370 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 186, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 382, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 336, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 458, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/engine/llm_engine.py", line 178, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/workspace/vllm/vllm/engine/llm_engine.py", line 255, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/workspace/vllm/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/worker.py", line 154, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/model_runner.py", line 787, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/model_runner.py", line 706, in execute_model
[rank0]:     hidden_states = model_executable(**execute_model_kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 367, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 292, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 231, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 160, in forward
[rank0]:     qkv, _ = self.qkv_proj(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/lora/layers.py", line 470, in forward
[rank0]:     output_parallel = self.apply(input_, bias)
[rank0]:   File "/workspace/vllm/vllm/lora/layers.py", line 853, in apply
[rank0]:     output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
[rank0]:   File "/workspace/vllm/vllm/model_executor/layers/quantization/awq.py", line 168, in apply
[rank0]:     out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
[rank0]:   File "/workspace/vllm/vllm/_custom_ops.py", line 119, in awq_dequantize
[rank0]:     return vllm_ops.awq_dequantize(qweight, scales, zeros, split_k_iters, thx,
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]: Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
[rank0]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x70ddb257a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[rank0]: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70ddb252ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[rank0]: frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x70ddb29e1718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #3: <unknown function> + 0x2ea76 (0x70ddb29bda76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #4: <unknown function> + 0x343e4 (0x70ddb29c33e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #5: <unknown function> + 0x35ca7 (0x70ddb29c4ca7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #6: <unknown function> + 0x360e7 (0x70ddb29c50e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #7: <unknown function> + 0x1866589 (0x70dd9a7bb589 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #8: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) + 0x14 (0x70dd9a7b51e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #9: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) + 0x111 (0x70dd660f6641 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #10: at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x36 (0x70dd660f6916 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #11: at::native::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x20 (0x70dd66334a30 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #12: <unknown function> + 0x329a789 (0x70dd6833f789 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #13: <unknown function> + 0x329a86b (0x70dd6833f86b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #14: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0xe7 (0x70dd9b7b9be7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #15: <unknown function> + 0x2c10def (0x70dd9bb65def in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #16: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x1a0 (0x70dd9b801a00 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #17: at::empty(c10::ArrayRef<long>, c10::TensorOptions, std::optional<c10::MemoryFormat>) + 0x150 (0x70dcec735c60 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #18: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, std::optional<c10::MemoryFormat>) + 0x8a (0x70dcec735dea in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #19: awq_dequantize(at::Tensor, at::Tensor, at::Tensor, int, int, int) + 0x249 (0x70dcec759609 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #20: <unknown function> + 0xf5449 (0x70dcec74f449 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #21: <unknown function> + 0xf123d (0x70dcec74b23d in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: <omitting python frames>

@jeejeelee
Copy link
Collaborator

@emillykkejensen It seems that the error is triggered by awq. It's possible that awq only supports SM80+. Have you tested lora using an FP16 model?

@jeejeelee
Copy link
Collaborator

You should clone my repo using:

git clone -b refactor-punica-kernel https://github.com/jeejeelee/vllm.git

@jeejeelee
Copy link
Collaborator

@emillykkejensen I can run awq+lora properly on TITAN RTX. FYI https://github.com/vllm-project/vllm/blob/main/csrc/quantization/awq/dequantize.cuh#L18

@emillykkejensen
Copy link

Hi again @jeejeelee

Sorry for that, you are 100% right! If I do the above, but clone the correct branch (!!) it works.

Thanks for the fix, and hope it will be merged into master soon :)

@emillykkejensen
Copy link

So I tried to build a local docker image using your branch: (docker build -t my-vllm-image https://github.com/jeejeelee/vllm.git#refactor-punica-kernel)

It seems to load vLLM and also load the model okay, but when I call it I get the following error:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 06-13 11:09:54 config.py:192] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-13 11:09:54 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', speculative_config=None, tokenizer='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-13 11:09:55 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-13 11:09:55 selector.py:51] Using XFormers backend.
INFO 06-13 11:09:56 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-13 11:09:56 selector.py:51] Using XFormers backend.
INFO 06-13 11:09:56 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-13 11:09:57 weight_utils.py:250] No model.safetensors.index.json found in remote.
INFO 06-13 11:10:08 model_runner.py:146] Loading model weights took 0.7370 GB
INFO 06-13 11:10:11 gpu_executor.py:83] # GPU blocks: 32795, # CPU blocks: 11915
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-13 11:10:14 serving_chat.py:83] No chat template provided. Chat API will not work.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-13 11:10:15 serving_embedding.py:131] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 06-13 11:10:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:35 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:45 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:55 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:05 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:23 async_llm_engine.py:545] Received request cmpl-ca79698496dd4702a6e821afaef7b588-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3087, 8970, 338, 263], lora_request: LoRARequest(lora_name='sql-lora', lora_int_id=1, lora_local_path='jashing/tinyllama-colorist-lora/', long_lora_max_len=None).
WARNING 06-13 11:11:23 tokenizer.py:142] No tokenizer found in jashing/tinyllama-colorist-lora/, using base model tokenizer instead. (Exception: Incorrect path_or_model_id: 'jashing/tinyllama-colorist-lora/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.)
ERROR 06-13 11:11:23 async_llm_engine.py:44] Engine background task failed
ERROR 06-13 11:11:23 async_llm_engine.py:44] Traceback (most recent call last):
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     lora = self._lora_model_cls.from_local_checkpoint(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
ERROR 06-13 11:11:23 async_llm_engine.py:44]     with open(lora_config_path) as f:
ERROR 06-13 11:11:23 async_llm_engine.py:44] FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'
ERROR 06-13 11:11:23 async_llm_engine.py:44] 
ERROR 06-13 11:11:23 async_llm_engine.py:44] The above exception was the direct cause of the following exception:
ERROR 06-13 11:11:23 async_llm_engine.py:44] 
ERROR 06-13 11:11:23 async_llm_engine.py:44] Traceback (most recent call last):
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
ERROR 06-13 11:11:23 async_llm_engine.py:44]     task.result()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
ERROR 06-13 11:11:23 async_llm_engine.py:44]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return fut.result()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
ERROR 06-13 11:11:23 async_llm_engine.py:44]     request_outputs = await self.engine.step_async()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = await self.model_executor.execute_model_async(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = await make_async(self.driver_worker.execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-13 11:11:23 async_llm_engine.py:44]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return func(*args, **kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return func(*args, **kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.set_active_loras(lora_requests, lora_mapping)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.lora_manager.set_active_loras(lora_requests, lora_mapping)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self._apply_loras(lora_requests)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.add_lora(lora)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     lora = self._load_lora(lora_request)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     raise RuntimeError(
ERROR 06-13 11:11:23 async_llm_engine.py:44] RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7328afd917e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7328a5023160>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7328afd917e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7328a5023160>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
    with open(lora_config_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
    self._apply_loras(lora_requests)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
    self.add_lora(lora)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
    lora = self._load_lora(lora_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 06-13 11:11:23 async_llm_engine.py:157] Aborted request cmpl-ca79698496dd4702a6e821afaef7b588-0.
INFO:     172.17.0.1:55552 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
    with open(lora_config_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 118, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 155, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 241, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 234, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 218, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self.process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 780, in process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 79, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
    self._apply_loras(lora_requests)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
    self.add_lora(lora)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
    lora = self._load_lora(lora_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed

@jeejeelee
Copy link
Collaborator

@emillykkejensen

FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'
ERROR 06-13 11:11:23 async_llm_engine.py:44] 

maybe you can try passing the lora path using a local absolute path.

@xz259
Copy link

xz259 commented Jun 21, 2024

@jeejeelee Hi, thank you so much for your work! If I just want to run LoRA on a T4, which of your previous commit should I build from?

@jeejeelee
Copy link
Collaborator

@jeejeelee Hi, thank you so much for your work! If I just want to run LoRA on a T4, which of your previous commit should I build from?

You can build from the last commit. If you have any questions, please feel free to contact me.

@naturomics
Copy link

the same problem applying lora for chatglm3-6b on T4 GPU

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 76, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1243, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 371, in forward
[rank0]:     hidden_states = self.transformer(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 319, in forward
[rank0]:     hidden_states = self.encoder(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 274, in forward
[rank0]:     hidden_states = layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 209, in forward
[rank0]:     attention_output = self.self_attention(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 108, in forward
[rank0]:     context_layer = self.attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 94, in forward
[rank0]:     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 279, in forward
[rank0]:     output = torch.empty_like(query)
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
``

@jeejeelee
Copy link
Collaborator

@naturomics Hi you can try #5036. It should be able to address your issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants