Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

Closed
chenqianfzh opened this issue Feb 16, 2024 · 2 comments
Closed

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

chenqianfzh opened this issue Feb 16, 2024 · 2 comments

Comments

@chenqianfzh
Copy link
Contributor

In the multi-lora feature, vllm refer to the BGMV in punica.

Yet in the punica project(https://github.com/punica-ai/punica), the authors said SGMV (Segmented Gather Matrix-Vector Multiplication) is more flexible. Is there a plan to support SGMV in the community?

@Darinochka
Copy link

There is an answer to your question here #1804

We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.

@chenqianfzh
Copy link
Contributor Author

Thanks for reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants