Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

jacobthebanana · 2024-03-07T21:54:47Z

Ensures the LoRA ID is a part of the hash used for prefix blocks.

jacobthebanana · 2024-03-07T22:02:20Z

Example unit test output with the revised test case and without the fix (see commit 3441735).

test_auto_prefix_caching passes when either the request specifies one lora adapter, or when no adapters was requested.
test_auto_prefix_caching does not pass when subsequent requests specify different adapters (or one request without adapter and one request with lora adapter enabled.)

$ git reset --hard 3441735
> HEAD is now at 3441735 Added test case of lora block_hash conflict.
$ pytest tests/test_cache_block_hashing.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0
plugins: forked-1.6.0, anyio-4.3.0, rerunfailures-13.0, asyncio-0.23.5
asyncio: mode=strict
collected 5 items                                                                                                                              

tests/test_cache_block_hashing.py ..FFF                                                                                                  [100%]

=================================================================== FAILURES ===================================================================
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1]

...

        for hash0, hash1 in zip(flatten_2d(hashes[0]), flatten_2d(hashes[1])):
>           assert (hash0 != hash1)
E           assert 6230683134333785342 != 6230683134333785342

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
=========================================================== short test summary info ============================================================
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
==================================================== 3 failed, 2 passed, 1 warning in 1.47s ====================================================

jacobthebanana · 2024-03-07T22:04:16Z

This PR closes #3264

Yard1

Thanks, that's exactly how it should be implemented!

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

JJEccles · 2024-11-22T08:40:43Z

Hi guys I'm looking for a solution for this issue but for the openai server calls where I request the Lora adapter in my post request. This is the command I use to get my server started:

vllm serve unsloth/Llama-3.2-3B
--tokenizer unsloth/Llama-3.2-3B
--port 8000
--max-model-len 2048
--enable-lora
--lora-modules profile_adapter=adapters_tokenizer_profile ingredientslist_adapter=adapters_tokenizer_list_ing
--max-lora-rank 64

And I was wondering If it's possible to then either adjust this server command or change something in the request for inference on the user side to be able to stop the caching affecting the responses when directly switching from one to another in inference calls. I'm hoping there is something I can add in the command to open the server that can solve this issue. If anyone could point me in the right direction It would be much appreciated!

jacobthebanana added 2 commits March 7, 2024 15:49

Added test case of lora block_hash conflict.

3441735

LoRA block_hash conflict: added test case and suggested fix

7d1b048

jacobthebanana mentioned this pull request Mar 7, 2024

Automatic Prefix Caching (#2792) might conflict with multi-LoRA (#1804) #3264

Closed

jacobthebanana marked this pull request as ready for review March 7, 2024 22:02

Yard1 approved these changes Mar 7, 2024

View reviewed changes

Yard1 enabled auto-merge (squash) March 7, 2024 22:06

Yard1 merged commit 8cbba46 into vllm-project:main Mar 7, 2024
23 checks passed

AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

fd6e57e

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

12634be

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

692c535

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

jacobthebanana commented Mar 7, 2024 •

edited by Yard1

Loading

jacobthebanana commented Mar 7, 2024

jacobthebanana commented Mar 7, 2024

Yard1 left a comment

JJEccles commented Nov 22, 2024

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Conversation

jacobthebanana commented Mar 7, 2024 • edited by Yard1 Loading

jacobthebanana commented Mar 7, 2024

jacobthebanana commented Mar 7, 2024

Yard1 left a comment

Choose a reason for hiding this comment

JJEccles commented Nov 22, 2024

jacobthebanana commented Mar 7, 2024 •

edited by Yard1

Loading