You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When deploying LoRA with vLLM, suppose I have 1000 different LoRA models, and each LoRA receives a separate request with a different input. In this scenario, how many times does the base model actually perform inference? Is it only once, or does it perform 1000 inferences?
I understand that the LoRA part will run 1000 times, but its computational cost is relatively small. I'm mainly concerned about how many times the base model runs inference in this case. If the base model only runs once, that would be incredibly efficient, meaning that as the number of LoRA models increases, the overall efficiency would improve significantly. Is this possible?
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
When deploying LoRA with vLLM, suppose I have 1000 different LoRA models, and each LoRA receives a separate request with a different input. In this scenario, how many times does the base model actually perform inference? Is it only once, or does it perform 1000 inferences?
Only once. If you want to delve deeper, see: #1804
It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.
It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.
It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
Question:
When deploying LoRA with vLLM, suppose I have 1000 different LoRA models, and each LoRA receives a separate request with a different input. In this scenario, how many times does the base model actually perform inference? Is it only once, or does it perform 1000 inferences?
I understand that the LoRA part will run 1000 times, but its computational cost is relatively small. I'm mainly concerned about how many times the base model runs inference in this case. If the base model only runs once, that would be incredibly efficient, meaning that as the number of LoRA models increases, the overall efficiency would improve significantly. Is this possible?
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: