-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-LoRA - Support for providing /load and /unload API #3308
Comments
I'm open to this because I anticipate this will be helpful for production use cases. PRs are welcomed with changes to OpenAI API server with route starting with |
let me draft a PR |
I have seen #3446 pop up |
@simon-mo there is an Open AI API to delete fine-tuned model https://platform.openai.com/docs/api-reference/models/delete ; we adopt this API? |
So that if I want to inference an LoRA model, needs to invoke the |
created one feature request but for multi models, I think the endpoint can be reused. |
Closing as completed by #6566 |
Problem statement:
In the production system, there should be an API to add\\remove fine-tuned weights dynamically. Inference caller should not have to specify LoRA location with each call.
Current Multi-LoRA support allows adaptor load during inference calls, which doesn't check if finetune weights are already loaded and ready for inferencing.
Proposal:
Introduce an API - /load and /unload to allow fine-tuned weights inclusions in vllm.
POST /load
-> add finetunes weight as part of models.POST /unload
-> remove finetunes weight from models list.This will allow the set of finetuned weights present in vllm server.
This will infer no need to specify finetune weight names, and locations as part of each inference request.
Sample code:
The text was updated successfully, but these errors were encountered: