-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263
Conversation
Example unit test output with the revised test case and without the fix (see commit 3441735).
$ git reset --hard 3441735
> HEAD is now at 3441735 Added test case of lora block_hash conflict.
$ pytest tests/test_cache_block_hashing.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0
plugins: forked-1.6.0, anyio-4.3.0, rerunfailures-13.0, asyncio-0.23.5
asyncio: mode=strict
collected 5 items
tests/test_cache_block_hashing.py ..FFF [100%]
=================================================================== FAILURES ===================================================================
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] __________________________________
model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1]
...
for hash0, hash1 in zip(flatten_2d(hashes[0]), flatten_2d(hashes[1])):
> assert (hash0 != hash1)
E assert 6230683134333785342 != 6230683134333785342
tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] __________________________________
model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1, 2]
...
tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] __________________________________
model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [1, 2]
...
tests/test_cache_block_hashing.py:84: AssertionError
=========================================================== short test summary info ============================================================
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
==================================================== 3 failed, 2 passed, 1 warning in 1.47s ==================================================== |
This PR closes #3264 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's exactly how it should be implemented!
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)
Hi guys I'm looking for a solution for this issue but for the openai server calls where I request the Lora adapter in my post request. This is the command I use to get my server started: vllm serve unsloth/Llama-3.2-3B And I was wondering If it's possible to then either adjust this server command or change something in the request for inference on the user side to be able to stop the caching affecting the responses when directly switching from one to another in inference calls. I'm hoping there is something I can add in the command to open the server that can solve this issue. If anyone could point me in the right direction It would be much appreciated! |
Ensures the LoRA ID is a part of the hash used for prefix blocks.