Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) #3263

Conversation

jacobthebanana
Copy link
Contributor

@jacobthebanana jacobthebanana commented Mar 7, 2024

Ensures the LoRA ID is a part of the hash used for prefix blocks.

@jacobthebanana
Copy link
Contributor Author

Example unit test output with the revised test case and without the fix (see commit 3441735).

  • test_auto_prefix_caching passes when either the request specifies one lora adapter, or when no adapters was requested.
  • test_auto_prefix_caching does not pass when subsequent requests specify different adapters (or one request without adapter and one request with lora adapter enabled.)
$ git reset --hard 3441735
> HEAD is now at 3441735 Added test case of lora block_hash conflict.
$ pytest tests/test_cache_block_hashing.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0
plugins: forked-1.6.0, anyio-4.3.0, rerunfailures-13.0, asyncio-0.23.5
asyncio: mode=strict
collected 5 items                                                                                                                              

tests/test_cache_block_hashing.py ..FFF                                                                                                  [100%]

=================================================================== FAILURES ===================================================================
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1]

...

        for hash0, hash1 in zip(flatten_2d(hashes[0]), flatten_2d(hashes[1])):
>           assert (hash0 != hash1)
E           assert 6230683134333785342 != 6230683134333785342

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
=========================================================== short test summary info ============================================================
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
==================================================== 3 failed, 2 passed, 1 warning in 1.47s ====================================================

@jacobthebanana jacobthebanana marked this pull request as ready for review March 7, 2024 22:02
@jacobthebanana
Copy link
Contributor Author

This PR closes #3264

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's exactly how it should be implemented!

@Yard1 Yard1 enabled auto-merge (squash) March 7, 2024 22:06
@Yard1 Yard1 merged commit 8cbba46 into vllm-project:main Mar 7, 2024
23 checks passed
AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
@JJEccles
Copy link

Hi guys I'm looking for a solution for this issue but for the openai server calls where I request the Lora adapter in my post request. This is the command I use to get my server started:

vllm serve unsloth/Llama-3.2-3B
--tokenizer unsloth/Llama-3.2-3B
--port 8000
--max-model-len 2048
--enable-lora
--lora-modules profile_adapter=adapters_tokenizer_profile ingredientslist_adapter=adapters_tokenizer_list_ing
--max-lora-rank 64

And I was wondering If it's possible to then either adjust this server command or change something in the request for inference on the user side to be able to stop the caching affecting the responses when directly switching from one to another in inference calls. I'm hoping there is something I can add in the command to open the server that can solve this issue. If anyone could point me in the right direction It would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants