Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-LoRA - Support for providing /load and /unload API #3308

Closed
gauravkr2108 opened this issue Mar 11, 2024 · 8 comments · May be fixed by #3496
Closed

Multi-LoRA - Support for providing /load and /unload API #3308

gauravkr2108 opened this issue Mar 11, 2024 · 8 comments · May be fixed by #3496

Comments

@gauravkr2108
Copy link

gauravkr2108 commented Mar 11, 2024

Problem statement:

In the production system, there should be an API to add\\remove fine-tuned weights dynamically. Inference caller should not have to specify LoRA location with each call.

Current Multi-LoRA support allows adaptor load during inference calls, which doesn't check if finetune weights are already loaded and ready for inferencing.

Proposal:

Introduce an API - /load and /unload to allow fine-tuned weights inclusions in vllm.

POST /load -> add finetunes weight as part of models.
POST /unload -> remove finetunes weight from models list.

This will allow the set of finetuned weights present in vllm server.

This will infer no need to specify finetune weight names, and locations as part of each inference request.

Sample code:

lora_request = None
index = 1
 
 
@app.post("/load")
async def load(request: Request) -> Response:
    request_dict = await request.json()
    global lora_request
 
    lora_local_path = request_dict.pop("lora_path", "/models/lora/")
    global index
    lora_request = LoRARequest(
        lora_name=lora_local_path,
        lora_int_id=index,
        lora_local_path=lora_local_path)
 
    index = index + 1
    return Response(status_code=201)
 
@app.post("/unload")
async def unload(request: Request) -> Response:
    """
    Unload API
    :param request:
    :return:
    """
    global lora_request
    lora_request = None
 
    global index
    if not index <= 1:
        index = index - 1
 
    return Response(status_code=201)
@simon-mo
Copy link
Collaborator

I'm open to this because I anticipate this will be helpful for production use cases. PRs are welcomed with changes to OpenAI API server with route starting with /-/ indicating private API. Such as PUT /-/lora_cache and DELETE /-/lora_cache.

@gauravkr2108
Copy link
Author

let me draft a PR

@simon-mo
Copy link
Collaborator

I have seen #3446 pop up

@gauravkr2108
Copy link
Author

@simon-mo add for add and delete operation; I can work with #3446 to add delete operation.

@gauravkr2108
Copy link
Author

@simon-mo there is an Open AI API to delete fine-tuned model https://platform.openai.com/docs/api-reference/models/delete ; we adopt this API?

@thincal
Copy link

thincal commented Mar 23, 2024

So that if I want to inference an LoRA model, needs to invoke the /load firstly then inference with lora model? if there are many vLLM engine instance deployed, the requests is load-balanced to selected vLLM instance, how does this two step design work to make sure inference happened in an instance with the lora just loaded already?

@lizzzcai
Copy link

created one feature request but for multi models, I think the endpoint can be reused.

@DarkLight1337
Copy link
Member

Closing as completed by #6566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants