-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-LoRA support #1804
Add multi-LoRA support #1804
Conversation
--------- Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>
The non-CUTLASS SGMV kernels would be very beneficial for any future ROCm support of the kernel. I am looking forward to it. |
I'm getting an error model = LLM(
"mistralai/Mistral-7B-v0.1",
tensor_parallel_size=1,
gpu_memory_utilization=0.8,
enable_lora=True,
) Am I missing any steps here? |
@sidnb13 Thanks for the report! The commit I just pushed ( |
@Yard1 Thanks for fixing! Running into an CUDA OOM error this time with the same code: |
@sidnb13 can you try reducing |
@Yard1 Thanks, was able to get inference working by reducing the default |
The increased memory usage is expected due to the current design requiring a preallocated LoRA tensor for every possible sequence. I will be looking into removing that requirement soon (so you can have eg. 32 LoRAs but 256 max sequences). |
I'm also running into errors installing from source with the latest commit. This happens with both |
@sidnb13 should be good now |
Added a very simple scheduler modification to allow for the number of LoRA slots to be smaller than the number of maximum sequences in a batch. Note that the resulting policy is not fair and can lead to starvation of certain LoRAs - it should be improved in the future. |
@WoosukKwon @zhuohan123 @Yard1 Looking forward to merging it into main. I would like to use this feature now. Thank you |
Nice job! |
@cgq0816 哥们,你这个服务部署好了吗?怎么处理的 |
|
thank you, I have solved this problem.
At 2024-01-31 10:04:55, "JohnSaxon" ***@***.***> wrote:
@oushu1zhangxiangxuan1 commented on this pull request.
In vllm/lora/request.py:
@@ -0,0 +1,31 @@
+from dataclasses import dataclass
+
+
***@***.***
+class LoRARequest:
Is integration with the OpenAI server in progress? If not, I'm willing to help
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Nice work! Question on this note
Does that mean it's not supported yet for LoRA hot swaps / dynamic runtime LoRA adapter load/unload, as seen in lorax. If not, any plan to have it supported? It can be a pretty useful feature to improve model seamlessly and continuously in a production always-on setting. Thanks! |
@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there) |
Cool yeah I'm envisioning a cyclic model improvement lifecycle, where e.g. every once in a while (day/week/etc.) we can collect model output of LoRA adapter v_x in production, combined with potential human feedback to re-fine-tune a v_{x+1} and hot swap that in to replace v_x. Essentially a time dimension that certain adapter may not exist at server deployment time but to be incorporated at a future runtime. Looks like it's mostly possible just with some peripheral wiring. Thanks for the context! |
We're a platform for this type of continuous improvement lifecycle (optionally personalized per user) at xler.ai! We'd love to get you access and hear your feedback! |
Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>
Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Shreyas Krishnaswamy <[email protected]> Co-authored-by: Avnish Narayan <[email protected]>
where to find docs for using the hot swaps? |
Should the documentation be updated to specify which architectures are supported for multi lora? |
@cgq0816 兄弟,你那个报错解决了吗,我跟你碰到相同的问题 |
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
#3316 is this normal? |
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
This PR adds support for running multiple LoRA adapters in a single batch in a similar fashion to the S-LoRA/punica projects.
WIP:
Features:
torch.compile
/CUDA graphsmax_num_seqs >= max_loras
.Limitations and possible improvements:
No changes have been made to the scheduler, meaning that we need to have GPU space for as many LoRAs as there are possible sequences in a batch. This should not be an issue in practice for small batch sizes, but may become problematic for larger ones. I will look into fixing this.