-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anyone tests lora's throughput #3316
Comments
It's normal for the speed to increase after merging because there is no need to perform matrix multiplication between the input and lora weights in each lora layer. |
Thanks @Nipi64310 for explanation. We are seeing similar slow-down as well. It is a trade-off between convivence and speed now. |
When merging there is no overhead in memory neither in computation. However, if willing to serve multiple subtasks, you will need to replicate the full model. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
I installed vllm from the lastest code.
found it supports Qwen2 series model.
I test Qwen1.8B with 16 concurrency. got the following result:
I merge the lora weight to Qwen1.8B. latency(ms):
min: 222, average: 400, max:418
without merging lora weight to Qwen1.8B, using lora dynamic calling through query.
min: 307, average: 780, max: 874
vllm lora way is much more slower than merging version? Is this okay?
The text was updated successfully, but these errors were encountered: