Add multi-LoRA support #1804

Yard1 · 2023-11-28T01:13:24Z

This PR adds support for running multiple LoRA adapters in a single batch in a similar fashion to the S-LoRA/punica projects.

WIP:

I want to clean up the code a little more/add some more documentation before merge.
I want to add some examples.
I need to test more models.

Features:

Uses PEFT format for LoRAs
Support for every linear layer + embeddings
Support for LoRA-added special tokens
Combine both LoRA and non-LoRA (base model only) requests in a single batch
Tiered Disk (unlimited)->CPU (LRU)->GPU (fixed number of slots) cache
Tensor parallelism support
Comprehensive testing
Support for LoRAs with different ranks (will be made even more performant in the future)
Performance oriented implementation that is friendly to torch.compile/CUDA graphs
Efficient BGMV kernels from punica (modified and vendored here)
A very simple, unfair scheduler policy has been added to allow for max_num_seqs >= max_loras.

Limitations and possible improvements:

Currently, only Llama and Mistral models are supported. There is no reason not to support other models, but only those two have been tested.
Running LoRAs with a quantized model is untested (may or may not work - need to check).
All LoRAs must have the same data type (will be coerced if needed).
We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.
punica kernels require compute capability >= 8.0
punica kernels cause compilation to take much longer. Should be possible to optimize.
Maximum supported LoRA rank is 64. This will change with new kernels.
Tensor parallelism is not used for sharding LoRA computation (as in S-LoRA paper). This should be trivial to add. Will wait on kernels to be updated first.
No changes have been made to the scheduler, meaning that we need to have GPU space for as many LoRAs as there are possible sequences in a batch. This should not be an issue in practice for small batch sizes, but may become problematic for larger ones. I will look into fixing this.
Unlike S-LoRA, we do not opt to combine LoRA memory and paged KV cache. Instead, we preallocate fixed LoRA slots. This allows for the design to be simpler, but the S-LoRA design could be applied in the future if needed.
The system operates under the assumption that all LoRA files are present on disk (there is no auto-download from S3/Hugging Face hub). This could be implemented outside of vLLM, or added in a follow up.
The loading/unloading of LoRA is not overlapped with forward pass, nor are we taking waiting requests for not loaded LoRAs into account. In practice, we have found the impact of that negligible for a reasonable number of LoRAs. However, it would be trivial to add support for that. Left as a follow up.
No changes have been made to the OpenAI server/entrypoint. Left as a follow up.

--------- Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>

tjtanaa · 2023-11-29T07:41:42Z

We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.

The non-CUTLASS SGMV kernels would be very beneficial for any future ROCm support of the kernel. I am looking forward to it.

sidnb13 · 2023-11-29T19:22:32Z

I'm getting an error RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 0 when initializing an engine for offline batched inference:

model = LLM(
    "mistralai/Mistral-7B-v0.1",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.8,
    enable_lora=True,
)

Am I missing any steps here?

Yard1 · 2023-11-29T20:24:28Z

@sidnb13 Thanks for the report! The commit I just pushed (a3f191a) should fix this.

sidnb13 · 2023-11-29T20:41:47Z

@Yard1 Thanks for fixing! Running into an CUDA OOM error this time with the same code:

I'm using a single A100-40GB. I did specify bfloat16 when initializing the engine, so curious why this is the case.

Yard1 · 2023-11-29T20:45:38Z

@sidnb13 can you try reducing max_num_seqs when initializing the engine? How many LoRAs are you using here? Does this happen when enable_lora is set to False?

sidnb13 · 2023-11-29T21:00:47Z

@Yard1 Thanks, was able to get inference working by reducing the default max_num_seqs from 256 to a much smaller number like 32. With enable_lora=False, I can use max_num_seqs=256.

Yard1 · 2023-11-29T21:03:27Z

The increased memory usage is expected due to the current design requiring a preallocated LoRA tensor for every possible sequence. I will be looking into removing that requirement soon (so you can have eg. 32 LoRAs but 256 max sequences).

sidnb13 · 2023-11-29T21:32:51Z

I'm also running into errors installing from source with the latest commit. This happens with both python setup.py install and pip install -e .. Seems like it's originating from compiling the punica CUDA extensions.

Yard1 · 2023-11-29T21:39:19Z

@sidnb13 should be good now

Yard1 · 2023-11-29T23:58:44Z

Added a very simple scheduler modification to allow for the number of LoRA slots to be smaller than the number of maximum sequences in a batch. Note that the resulting policy is not fair and can lead to starvation of certain LoRAs - it should be improved in the future.

junior-zsy · 2023-12-01T03:18:46Z

@WoosukKwon @zhuohan123 @Yard1 Looking forward to merging it into main. I would like to use this feature now. Thank you

robertgshaw2-neuralmagic · 2024-01-26T13:15:28Z

Nice job!

zzizzb · 2024-01-29T03:42:39Z

@cgq0816 哥们，你这个服务部署好了吗？怎么处理的

cgq0816 · 2024-01-29T03:50:58Z

@cgq0816 哥们，你这个服务部署好了吗？怎么处理的
没有部署好，提示下面的问题：

zzizzb · 2024-01-31T02:12:36Z

thank you, I have solved this problem. At 2024-01-31 10:04:55, "JohnSaxon" ***@***.***> wrote: @oushu1zhangxiangxuan1 commented on this pull request. In vllm/lora/request.py:

@@ -0,0 +1,31 @@

+from dataclasses import dataclass + + ***@***.*** +class LoRARequest: Is integration with the OpenAI server in progress? If not, I'm willing to help — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Peilun-Li · 2024-01-31T19:02:02Z

Nice work! Question on this note

The system operates under the assumption that all LoRA files are present on disk (there is no auto-download from S3/Hugging Face hub). This could be implemented outside of vLLM, or added in a follow up.

Does that mean it's not supported yet for LoRA hot swaps / dynamic runtime LoRA adapter load/unload, as seen in lorax. If not, any plan to have it supported? It can be a pretty useful feature to improve model seamlessly and continuously in a production always-on setting. Thanks!

Yard1 · 2024-01-31T19:03:54Z

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

Peilun-Li · 2024-01-31T19:34:54Z

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

Cool yeah I'm envisioning a cyclic model improvement lifecycle, where e.g. every once in a while (day/week/etc.) we can collect model output of LoRA adapter v_x in production, combined with potential human feedback to re-fine-tune a v_{x+1} and hot swap that in to replace v_x. Essentially a time dimension that certain adapter may not exist at server deployment time but to be incorporated at a future runtime. Looks like it's mostly possible just with some peripheral wiring. Thanks for the context!

keeganmccallum · 2024-01-31T19:44:59Z

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

Cool yeah I'm envisioning a cyclic model improvement lifecycle, where e.g. every once in a while (day/week/etc.) we can collect model output of LoRA adapter v_x in production, combined with potential human feedback to re-fine-tune a v_{x+1} and hot swap that in to replace v_x. Essentially a time dimension that certain adapter may not exist at server deployment time but to be incorporated at a future runtime. Looks like it's mostly possible just with some peripheral wiring. Thanks for the context!

We're a platform for this type of continuous improvement lifecycle (optionally personalized per user) at xler.ai! We'd love to get you access and hear your feedback!

Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>

arshadshk · 2024-02-21T18:36:03Z

where to find docs for using the hot swaps?

nootums · 2024-02-21T18:41:07Z

https://docs.vllm.ai/en/latest/models/lora.html
https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py

debraj135 · 2024-02-22T04:37:31Z

Should the documentation be updated to specify which architectures are supported for multi lora?

debate1 · 2024-02-25T03:28:16Z

@cgq0816 兄弟，你那个报错解决了吗，我跟你碰到相同的问题

…d multi-LoRA support (#1804) (#3263)

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

white-wolf-tech · 2024-03-13T01:52:47Z

#3316 is this normal?

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

devlup · 2024-08-07T18:10:40Z

@jvmncs @xjpang @Yard1 please refer this bug, it seems it only works with one lora #7169

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

Yard1 and others added 2 commits November 27, 2023 16:45

Add multi-LoRA support

1477ba0

--------- Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>

Merge branch 'main' into multi_lora

842aa1f

Yard1 requested review from zhuohan123 and WoosukKwon November 28, 2023 01:13

Lint

dd1726f

l1cacheDell mentioned this pull request Nov 29, 2023

Add multi-lora support for Triton vLLM backend triton-inference-server/vllm_backend#23

Merged

Add rank check

6c66b6e

Yard1 added 2 commits November 29, 2023 11:54

Add example, minor tweaks

70eaca6

Fix dummy lora init for packed layers

a3f191a

Yard1 added 2 commits November 29, 2023 13:09

Fix capacity

240cee9

Lint

c4d57a5

Remove rank 128 for now

471f25a

Yard1 added 3 commits November 29, 2023 13:57

Pass to scheduler

ccbb4b7

Add simple scheduler support

5a1a0be

Update example

1b00e50

Yard1 added 4 commits November 29, 2023 16:02

Fix

6bda3c3

Update tests

de02961

Merge branch 'main' into multi_lora

0afd3c1

Cleanup

849831e

mhillebrand mentioned this pull request Jan 26, 2024

Adapter support huggingface/text-generation-inference#378

Closed

pcmoritz mentioned this pull request Feb 15, 2024

Fix decilm.py #2883

Merged

WoosukKwon mentioned this pull request Feb 15, 2024

[BugFix] Fix GC bug for LLM class #2882

Merged

Darinochka mentioned this pull request Feb 21, 2024

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

Closed

jacobthebanana mentioned this pull request Mar 7, 2024

Automatic Prefix Caching (#2792) might conflict with multi-LoRA (#1804) #3264

Closed

Yard1 pushed a commit that referenced this pull request Mar 7, 2024

Possible fix for conflict between Automated Prefix Caching (#2762) an…

8cbba46

…d multi-LoRA support (#1804) (#3263)

AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

fd6e57e

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

FurtherAI mentioned this pull request Mar 20, 2024

[Kernel] Full Tensor Parallelism for LoRA Layers #3524

Merged

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

12634be

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

jeejeelee mentioned this pull request Apr 11, 2024

[Usage]: Vllm inference slower for LoRA models #3979

Closed

jeejeelee mentioned this pull request May 24, 2024

[Kernel][RFC] Refactor the punica kernel based on Triton #5036

Merged

3 tasks

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

692c535

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

jeejeelee mentioned this pull request Sep 6, 2024

[Performance]: Clarification on Base Model Inference Count with Multiple LoRA Models in vLLM Deployment #8228

Closed

1 task

jeejeelee mentioned this pull request Nov 25, 2024

[Misc] Allow LoRA to adaptively increase rank and remove possible_max_ranks #10623

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-LoRA support #1804

Add multi-LoRA support #1804

Yard1 commented Nov 28, 2023 •

edited

Loading

tjtanaa commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023 •

edited

Loading

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

junior-zsy commented Dec 1, 2023

robertgshaw2-neuralmagic commented Jan 26, 2024

zzizzb commented Jan 29, 2024

cgq0816 commented Jan 29, 2024

zzizzb commented Jan 31, 2024 via email

Peilun-Li commented Jan 31, 2024

Yard1 commented Jan 31, 2024 •

edited

Loading

Peilun-Li commented Jan 31, 2024

keeganmccallum commented Jan 31, 2024 •

edited

Loading

arshadshk commented Feb 21, 2024

nootums commented Feb 21, 2024

debraj135 commented Feb 22, 2024

debate1 commented Feb 25, 2024

white-wolf-tech commented Mar 13, 2024

devlup commented Aug 7, 2024

Add multi-LoRA support #1804

Add multi-LoRA support #1804

Conversation

Yard1 commented Nov 28, 2023 • edited Loading

tjtanaa commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023 • edited Loading

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

sidnb13 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

Yard1 commented Nov 29, 2023

junior-zsy commented Dec 1, 2023

robertgshaw2-neuralmagic commented Jan 26, 2024

zzizzb commented Jan 29, 2024

cgq0816 commented Jan 29, 2024

zzizzb commented Jan 31, 2024 via email

Peilun-Li commented Jan 31, 2024

Yard1 commented Jan 31, 2024 • edited Loading

Peilun-Li commented Jan 31, 2024

keeganmccallum commented Jan 31, 2024 • edited Loading

arshadshk commented Feb 21, 2024

nootums commented Feb 21, 2024

debraj135 commented Feb 22, 2024

debate1 commented Feb 25, 2024

white-wolf-tech commented Mar 13, 2024

devlup commented Aug 7, 2024

Yard1 commented Nov 28, 2023 •

edited

Loading

Yard1 commented Nov 29, 2023 •

edited

Loading

Yard1 commented Jan 31, 2024 •

edited

Loading

keeganmccallum commented Jan 31, 2024 •

edited

Loading