[Performance] 40% performance drop using lora vs no lora #2829

brucethemoose · 2024-02-10T02:14:40Z

...
Processed prompts: 100% 1321/1321 [09:09<00:00,  2.40it/s]
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/checkpoint-1000'"
Processed prompts: 100% 1321/1321 [16:01<00:00,  1.37it/s]
training  | ic| "Lora '" + lora + "' testing complete.": "Lora 'models/v3/checkpoint-1000' testing complete."
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/final_lora'"
Processed prompts: 100% 1321/1321 [15:59<00:00,  1.38it/s]
training  | ic| "Lora '" + lora + "' testing complete.": "Lora 'models/v3/final_lora' testing complete."
training  | ic| "Testing '" + lora + "'": "Testing 'models/v3/checkpoint-2000'"
...

Testing offline batched inference with a big bunch of long context, short reply prompts, feeding the vllm engine all 1321 prompts as a list in a single batch. Tested on an 80GB A100.

This isn't a bug per se, or a major problem, more like a... note. I guess a big performance drop for loras is expected?

The text was updated successfully, but these errors were encountered:

Yard1 · 2024-02-12T16:22:48Z

With the current kernels it's expected they will not be able to handle long prompt-short reply case very well. In general around 20% performance drop is expected.

sfc-gh-ybsat · 2024-02-15T16:18:07Z

@Yard1 when do you expect Punica to ship the SGMV kernels? Looking at their github repo, seems like there is no progress being made in the past month

Yard1 · 2024-02-17T19:39:37Z

I think we can look into whether we can use the existing SGMV kernels in currently released punica!

chenqianfzh · 2024-02-27T19:13:02Z

#2893

adding the thread as there are some info about SGMV & BGKV is shared in that thread.

findalexli · 2024-02-28T18:14:17Z

@Yard1 How is the 20% drop calculated?

mgoin · 2024-08-02T15:25:24Z

This should be lessened by about 2x with the new landed Triton kernels #5036

colefranks · 2024-08-02T21:27:15Z

@mgoin I see you have closed the issue, can you share your numbers? I am wondering how much the long-prompt/short output case has improved. The numbers in #5036 don't really cover this case as far as I can tell.

jeejeelee · 2024-09-12T02:11:32Z

@mgoin I see you have closed the issue, can you share your numbers? I am wondering how much the long-prompt/short output case has improved. The numbers in #5036 don't really cover this case as far as I can tell.

@colefranks Could you please provide more detailed information, such as the model and LoRA rank used?

jeejeelee mentioned this issue Apr 11, 2024

[Usage]: Vllm inference slower for LoRA models #3979

Closed

jeejeelee mentioned this issue May 24, 2024

[Kernel][RFC] Refactor the punica kernel based on Triton #5036

Merged

3 tasks

FurtherAI mentioned this issue Jun 8, 2024

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

Closed

1 task

mgoin closed this as completed Aug 2, 2024

Pavloveuge mentioned this issue Sep 11, 2024

[Usage]: Execution speed of non-Lora requests #8368

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] 40% performance drop using lora vs no lora #2829

[Performance] 40% performance drop using lora vs no lora #2829

brucethemoose commented Feb 10, 2024 •

edited

Loading

Yard1 commented Feb 12, 2024 •

edited

Loading

sfc-gh-ybsat commented Feb 15, 2024

Yard1 commented Feb 17, 2024

chenqianfzh commented Feb 27, 2024

findalexli commented Feb 28, 2024

mgoin commented Aug 2, 2024

colefranks commented Aug 2, 2024 •

edited

Loading

jeejeelee commented Sep 12, 2024

[Performance] 40% performance drop using lora vs no lora #2829

[Performance] 40% performance drop using lora vs no lora #2829

Comments

brucethemoose commented Feb 10, 2024 • edited Loading

Yard1 commented Feb 12, 2024 • edited Loading

sfc-gh-ybsat commented Feb 15, 2024

Yard1 commented Feb 17, 2024

chenqianfzh commented Feb 27, 2024

findalexli commented Feb 28, 2024

mgoin commented Aug 2, 2024

colefranks commented Aug 2, 2024 • edited Loading

jeejeelee commented Sep 12, 2024

brucethemoose commented Feb 10, 2024 •

edited

Loading

Yard1 commented Feb 12, 2024 •

edited

Loading

colefranks commented Aug 2, 2024 •

edited

Loading