-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enable prefix caching with block manager v2 enabled #4142
Conversation
leiwen83
commented
Apr 17, 2024
•
edited
Loading
edited
@cadedaniel |
Hi, @leiwen83 looks like we collided a bit. I just posted #4146 which also adds prefix caching support to V2 of the block manager. Would you be willing to work together to get that PR over the finish line? The main thing I like from this PR is that it stores references to the blocks in the PrefixCachingBlockAllocator. I did something similar with a new BlockMetaData structure, but I think we can make some minor adjustments to the existing LRU evictor such that it will work with generic block objects. That way neither one of us has to add a new, but basically the same, evictor :). Occasional duplication of work is an unfortunate reality with popular open source projects, but I think we can work together and end up merging something pretty good. |
Hi @SageMoore what a coincidence ;) But I'd like to hear from @cadedaniel first before take further action, he may give us some good advice. Thx, |
Thanks @leiwen83 and @SageMoore for the work ~~ Priority-wise, whoever has a more complete reviewable PR should get review precedence. I'll look at this one first as it's no longer WIP / tests are passing. @SageMoore feel free to add review comments on missing requirements or improvements. I haven't thought through the Evictor design tradeoffs so especially clarifying which approach we should go with for that would be helpful! In the future we should be more communicative in #3667 so we don't have mixups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving incrementally here is good. Let's try to get this PR into a mergable state and go from there.
There are four high level changes that I would suggest for this PR.
- Move the core/block/evictor.py file to core/evictor_v2.py, similar to what was done for block_manager. Rename core/evictor.py to core/evictor_v1.py.
- Let's remove unused_cached_blocks from the PrefixCachingBlockAllocator and just rely on the evictor.
- Assuming that we don't want to make the policy an argument to the BlockSpaceManagerV2, lets move the code that records last_accessed and computed into the PrefixCachingBlockAllocator. The code to manage and record these two variables will likely change if a non-LRU policy is used. For example, we wouldn't want to spend time recording last_accessed time if we switched to a FIFO eviction policy.
- Evictor unit tests would be good. The eviction logic is somewhat cumbersome to exercise in integration tests because it only runs when we run out of blocks.
vllm/core/block/evictor.py
Outdated
pass | ||
|
||
|
||
class Block(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of a name collision here. Can you rename this since "Block" is already used extensively? I went with BlockMetaData. Feel free to use that or a new name :). EvictorBlockInfo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockMetaData is nice, I'll take it~
@@ -351,6 +376,9 @@ def __init__( | |||
self._prev_block = prev_block | |||
self._cached_content_hash: Optional[int] = None | |||
self._prefix_caching_allocator = prefix_caching_allocator | |||
self.computed = False | |||
self.num_hashed_tokens = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Instead of manually setting this when marking blocks as computed, you should be able to do something like:
@property
def num_tokens_total(self) -> int:
return self._prev_block.num_tokens_total + len(self.token_ids)
You'll have to account for the _prev_block == None case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get the num_tokens_total usage here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm talking about using the "num_tokens_total" property that I posted above as a replacement for "num_hashed_tokens". The primary benefit is that you won't have to update it as part of "mark_blocks_as_computed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, add this property in latest commit
Great thanks for those comments! And I borrow the evictor test case from your PR, and add some change to make it pass. |
807808e
to
3daae3c
Compare
This looks reasonable to me. Thanks for working on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Some missing things:
- Can you add a PR description? it helps with fast reviews -- you can detail your approach, the testing, what's missing, etc. Shouldn't take long!
- To call automatic prefix caching in BlockManagerV2 "done", we need end-to-end tests.
- See this example
vllm/tests/core/block/e2e/test_correctness.py
Lines 187 to 199 in ceaf4ed
def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator, test_llm_generator, batch_size): """Verify vLLM produces the same output with greedy sampling, when lookahead scheduling is used vs. not. Lookahead scheduling is not expected to modify the output, as it simply allocates empty slots ahead of the known token ids in a sliding fashion. This test constrains the total number of blocks to force preemption. It also varies the block size so that the lookahead size is less than and greater than the block size. """ - Off the top of my head, we'll want to test the following in APC. Maybe @SageMoore can also help here, since it's a bit out of scope of the BlockManagerV2 itself.
- Greedy output equality with APC and without APC
- Greedy output equality with APC + preemption
- See this example
This PR is for enabel APC feature over block manager V2. APC core idea is like maintain a table mapping between content hash and physical block. If one block is full, then it would be prompted to be immutable. While if several immutable comes in sequence, we kept the first physical block, while free the other. Those freed physical block, aka no reference at scheduler, would be housekept inside evictor, when we try to alloc block again under pressue, and find no free one as in the initial state, we could resort to evictor to pop one unrefered computed block, which is used to store new data comes from upper layer. The evictor policy could be changed, but current only implement LRU one. Co-authored-by: Sage Moore <[email protected]>
7163552
to
ebc941e
Compare
Got it, I refactor the PR and make the change as suggested. For e2e test, I add two, one is prefill caching enabled while comparing v1 vs v2, and in v2 + preemption, w/ APC vs w/o APC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a TODO on the O(N^2) num_total_tokens. Otherwise LGTM.
_block = self | ||
self._cached_num_tokens_total = 0 | ||
while _block is not None: | ||
self._cached_num_tokens_total += len(_block.token_ids) | ||
_block = _block.prev_block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still O(N^2) for the entire sequence, e.g. each block will perform O(N) scan. We should use the design I linked to you content_hash
which is O(N).
I will not block the PR (we can optimize later) but you should add a TODO!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I would add this TODO note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leiwen83
…ect#4142) Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>
…ect#4142) Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>
…ect#4142) Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>
…ect#4142) Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>