-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] 40% performance drop using lora vs no lora #2829
Comments
With the current kernels it's expected they will not be able to handle long prompt-short reply case very well. In general around 20% performance drop is expected. |
@Yard1 when do you expect Punica to ship the SGMV kernels? Looking at their github repo, seems like there is no progress being made in the past month |
I think we can look into whether we can use the existing SGMV kernels in currently released punica! |
adding the thread as there are some info about SGMV & BGKV is shared in that thread. |
@Yard1 How is the 20% drop calculated? |
This should be lessened by about 2x with the new landed Triton kernels #5036 |
@colefranks Could you please provide more detailed information, such as the model and LoRA rank used? |
Testing offline batched inference with a big bunch of long context, short reply prompts, feeding the vllm engine all 1321 prompts as a list in a single batch. Tested on an 80GB A100.
This isn't a bug per se, or a major problem, more like a... note. I guess a big performance drop for loras is expected?
The text was updated successfully, but these errors were encountered: