Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] 40% performance drop using lora vs no lora #2829

Closed
brucethemoose opened this issue Feb 10, 2024 · 8 comments
Closed

[Performance] 40% performance drop using lora vs no lora #2829

brucethemoose opened this issue Feb 10, 2024 · 8 comments

Comments

@brucethemoose
Copy link

brucethemoose commented Feb 10, 2024

...
Processed prompts: 100% 1321/1321 [09:09<00:00,  2.40it/s]
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/checkpoint-1000'"
Processed prompts: 100% 1321/1321 [16:01<00:00,  1.37it/s]
training  | ic| "Lora '" + lora + "' testing complete.": "Lora 'models/v3/checkpoint-1000' testing complete."
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/final_lora'"
Processed prompts: 100% 1321/1321 [15:59<00:00,  1.38it/s]
training  | ic| "Lora '" + lora + "' testing complete.": "Lora 'models/v3/final_lora' testing complete."
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/checkpoint-2000'"
...

Testing offline batched inference with a big bunch of long context, short reply prompts, feeding the vllm engine all 1321 prompts as a list in a single batch. Tested on an 80GB A100.

This isn't a bug per se, or a major problem, more like a... note. I guess a big performance drop for loras is expected?

@Yard1
Copy link
Collaborator

Yard1 commented Feb 12, 2024

With the current kernels it's expected they will not be able to handle long prompt-short reply case very well. In general around 20% performance drop is expected.

@sfc-gh-ybsat
Copy link

@Yard1 when do you expect Punica to ship the SGMV kernels? Looking at their github repo, seems like there is no progress being made in the past month

@Yard1
Copy link
Collaborator

Yard1 commented Feb 17, 2024

I think we can look into whether we can use the existing SGMV kernels in currently released punica!

@chenqianfzh
Copy link
Contributor

#2893

adding the thread as there are some info about SGMV & BGKV is shared in that thread.

@findalexli
Copy link

@Yard1 How is the 20% drop calculated?

@mgoin
Copy link
Member

mgoin commented Aug 2, 2024

This should be lessened by about 2x with the new landed Triton kernels #5036

@mgoin mgoin closed this as completed Aug 2, 2024
@colefranks
Copy link

colefranks commented Aug 2, 2024

@mgoin I see you have closed the issue, can you share your numbers? I am wondering how much the long-prompt/short output case has improved. The numbers in #5036 don't really cover this case as far as I can tell.

@jeejeelee
Copy link
Collaborator

@mgoin I see you have closed the issue, can you share your numbers? I am wondering how much the long-prompt/short output case has improved. The numbers in #5036 don't really cover this case as far as I can tell.

@colefranks Could you please provide more detailed information, such as the model and LoRA rank used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants