[RFC]: Enhancing LoRA Management for Production Environments in vLLM #6275

Jeffwan · 2024-07-10T00:54:14Z

This RFC proposes improvements to the management of Low-Rank Adaptation (LoRA) in vLLM to make it more suitable for production environments. This proposal aims to address several pain points observed in the current implementation. Feedback and discussions are welcome, and we hope to gather input and refine the proposal based on community insights.

Motivation.

This RFC proposes improvements to the management of Low-Rank Adaptation (LoRA) in vLLM to make it more suitable for production environments. This proposal aims to address several pain points observed in the current implementation. Feedback and discussions are welcome, and we hope to gather input and refine the proposal based on community insights.

Motivation

LoRA integration in production environments faces several challenges that need to be addressed to ensure smooth and efficient deployment and management. The main issues observed include:

Visibility of LoRA Information: Currently, the relationship between LoRA and base models is not exposed clearly by the engine. The /v1/models endpoint does not display this information. Related issues: [Feature]: Expose Lora lineage information from /v1/models #6274
Dynamic Loading and Unloading: LoRA adapters cannot be dynamically loaded or unloaded after the server has started. Related issues: Multi-LoRA - Support for providing /load and /unload API #3308 [Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors #4068 [Feature]: load/unload API to run multiple LLMs in a single GPU instance #5491
Remote Registry Support: LoRA adapters cannot be pulled from remote model repositories during runtime, making it cumbersome to manage artifacts locally. Related issues: [Feature]: Support loading lora adapters from HuggingFace in runtime #6233 [Bug]: relative path doesn't work for Lora adapter model #6231
Observability: There is a lack of metrics and observability enhancements related to LoRA, making it difficult to monitor and manage.
Cluster level Support: Information about LoRA is not easily accessible to resource managers, hindering support for service discovery, load balancing, and scheduling in cluster environments. Related issues: [RFC]: Add control panel support for vLLM #4873

Proposed Change.

1. Support Dynamically Loading or Unloading LoRA Adapters

To enhance flexibility and manageability, we propose introducing the ability to dynamically load and unload LoRA adapters at runtime.

Expose /v1/add_adapter and /v1/remove_adapter in api_server.py.
Introducing lazy and eager loading modes for LoRA adapters will provide more flexibility in deployment strategies. If lazy mode is selected, we can simply add lora to LoraRequest, otherwise, we should let the engine to load the lora via lora_manager explicitly.

2. Load LoRA Adapters from Remote Storage

Enabling LoRA adapters to be loaded from remote storage during runtime will simplify artifact management and deployment processes. The technical detail could be adding get_adapter_absolute_path ,

it can expand relative path
It can download hugging face models and return the snapshot path
Refactor the lora path reference from loral_local_path to local_path

3. Build Better LoRA Model Lineage

To improve the visibility and management of LoRA models, we propose building a more robust model lineage metadata. This system will:

Update LoRAParserAction to support json , we need to ask user to explicitly specify the base modelhttps://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/cli_args.py#L16-L23
Introduce BaseModelPath to replace served_model_names https://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/serving_engine.py#L33. It would be great to pass the model path and model names separately
Update show_available_models to update root and parent https://github.com/Jeffwan/vllm/blob/dd793d1de59b5efad25f4794b68cb935824c7a11/vllm/entrypoints/openai/serving_engine.py#L61-L62

4. Lora Observability enhancement

Improving observability by adding metrics specific to LoRA will help in better monitoring and management. Proposed metrics include:

Loading and unloading times for LoRA adapters.
Memory and compute resource usage by LoRA adapters.
Performance impact on base models when using LoRA adapters.

5. Control Plane support(service discovery, load balancing, scheduling) for Loras

Since vLLM community focus more on the inference engine, the cluster level features would be a separate design I am working on in Kubernetes WG-Serving. I will link back to this issue shortly.

PR List

Feedback Period.

No response

CC List.

@simon-mo @Yard1

Note: Please help tag the right person who worked in this area.

Any Other Things.

No response

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-07-10T23:46:09Z

I'm in favor of all these! Please also make sure it is well documented.

Yard1 · 2024-07-11T00:35:59Z

Yes, this all makes sense. Let's make sure to ensure that performance doesn't degrade too much with loading from remote storage.

codybum · 2024-07-14T01:38:06Z

Yes, the issue you have noted prevent us from running vLLM. I would also include the ability to apply (merge) more than one adapter simultaneously to a single request.

I am looking forward to these features making itself into vLLM.

lizzzcai · 2024-07-26T06:45:04Z

Hi Jeff,

Thank you for sharing the RFC on Lora. I noticed my feature request was included, which is appreciated. Want to check whether there are plans to implement the load/unload API for the base model? Thanks in advance for your attention to this matter.

llama-shepard · 2024-08-05T16:06:39Z

I would love to add the following feature in this RFC.

LOAD ADAPTERS FROM S3 SUPPORTED STORAGES

LoRAX has this feature https://loraexchange.ai/models/adapters/#s3

This brings a new challenge in vLLM. Including type of the source (huggingface or s3). This is handled in LoRAX by providing the default 'adapter-source'.

It needs to support storages which supports S3 schema (like Cloudflare R2) https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/sources/s3.py

--env "R2_ACCOUNT_ID={r2_account_id}" --env "AWS_ACCESS_KEY_ID={aws_access_key_id}" --env "AWS_SECRET_ACCESS_KEY={aws_secret_access_key}"

S3_ENDPOINT_URL = os.environ.get("S3_ENDPOINT_URL", None)
R2_ACCOUNT_ID = os.environ.get("R2_ACCOUNT_ID", None)

if R2_ACCOUNT_ID:
    s3 = boto3.resource("s3", endpoint_url=f"https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com", config=config)
    return s3.Bucket(bucket_name)
elif S3_ENDPOINT_URL:
    s3 = boto3.resource("s3", endpoint_url=f"{S3_ENDPOINT_URL}", config=config)
    return s3.Bucket(bucket_name)
else:
    s3 = boto3.resource("s3", config=config)
    return s3.Bucket(bucket_name)

Jeffwan added the RFC label Jul 10, 2024

Jeffwan mentioned this issue Jul 24, 2024

[Core] Support load and unload LoRA in api server #6566

Merged

nstogner mentioned this issue Aug 27, 2024

Support dynamic LoRA serving substratusai/kubeai#132

Open

Jeffwan mentioned this issue Sep 19, 2024

[Core] Support Lora lineage and base model metadata management #6315

Merged

liu-cong mentioned this issue Nov 6, 2024

[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management #10086

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Enhancing LoRA Management for Production Environments in vLLM #6275

[RFC]: Enhancing LoRA Management for Production Environments in vLLM #6275

Jeffwan commented Jul 10, 2024 •

edited

Loading

simon-mo commented Jul 10, 2024

Yard1 commented Jul 11, 2024

codybum commented Jul 14, 2024

lizzzcai commented Jul 26, 2024

llama-shepard commented Aug 5, 2024 •

edited by linear bot

Loading

[RFC]: Enhancing LoRA Management for Production Environments in vLLM #6275

[RFC]: Enhancing LoRA Management for Production Environments in vLLM #6275

Comments

Jeffwan commented Jul 10, 2024 • edited Loading

Motivation.

Motivation

Proposed Change.

1. Support Dynamically Loading or Unloading LoRA Adapters

2. Load LoRA Adapters from Remote Storage

3. Build Better LoRA Model Lineage

4. Lora Observability enhancement

5. Control Plane support(service discovery, load balancing, scheduling) for Loras

PR List

Feedback Period.

CC List.

Any Other Things.

simon-mo commented Jul 10, 2024

Yard1 commented Jul 11, 2024

codybum commented Jul 14, 2024

lizzzcai commented Jul 26, 2024

llama-shepard commented Aug 5, 2024 • edited by linear bot Loading

LOAD ADAPTERS FROM S3 SUPPORTED STORAGES

Jeffwan commented Jul 10, 2024 •

edited

Loading

llama-shepard commented Aug 5, 2024 •

edited by linear bot

Loading