Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enable prefix caching with block manager v2 enabled #4142

Merged
merged 7 commits into from
May 1, 2024

Conversation

leiwen83
Copy link
Contributor

@leiwen83 leiwen83 commented Apr 17, 2024

[Core] Enable prefix caching with block manager v2 enabled

This PR is for enabel APC feature over block manager V2.

APC core idea is like maintain a table mapping between content hash and
physical block. If one block is full, then it would be prompted to be
immutable. While if several immutable comes in sequence, we kept the
first physical block, while free the other.

Those freed physical block, aka no reference at scheduler, would be
housekept inside evictor, when we try to alloc block again under pressue,
and find no free one as in the initial state, we could resort to
evictor to pop one unrefered computed block, which is used to store
new data comes from upper layer.

The evictor policy could be changed, but current only implement LRU one.

Co-authored-by: Sage Moore <[email protected]>

@leiwen83
Copy link
Contributor Author

@cadedaniel
Would you mind take a look for this patch to enable the prefix caching with block manager V2?

@SageMoore
Copy link
Contributor

SageMoore commented Apr 17, 2024

Hi, @leiwen83 looks like we collided a bit. I just posted #4146 which also adds prefix caching support to V2 of the block manager. Would you be willing to work together to get that PR over the finish line? The main thing I like from this PR is that it stores references to the blocks in the PrefixCachingBlockAllocator. I did something similar with a new BlockMetaData structure, but I think we can make some minor adjustments to the existing LRU evictor such that it will work with generic block objects. That way neither one of us has to add a new, but basically the same, evictor :).

Occasional duplication of work is an unfortunate reality with popular open source projects, but I think we can work together and end up merging something pretty good.

@leiwen83
Copy link
Contributor Author

Hi @SageMoore

what a coincidence ;)

But I'd like to hear from @cadedaniel first before take further action, he may give us some good advice.

Thx,
Lei

@cadedaniel
Copy link
Collaborator

Thanks @leiwen83 and @SageMoore for the work ~~

Priority-wise, whoever has a more complete reviewable PR should get review precedence. I'll look at this one first as it's no longer WIP / tests are passing. @SageMoore feel free to add review comments on missing requirements or improvements. I haven't thought through the Evictor design tradeoffs so especially clarifying which approach we should go with for that would be helpful!

In the future we should be more communicative in #3667 so we don't have mixups

Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving incrementally here is good. Let's try to get this PR into a mergable state and go from there.

There are four high level changes that I would suggest for this PR.

  1. Move the core/block/evictor.py file to core/evictor_v2.py, similar to what was done for block_manager. Rename core/evictor.py to core/evictor_v1.py.
  2. Let's remove unused_cached_blocks from the PrefixCachingBlockAllocator and just rely on the evictor.
  3. Assuming that we don't want to make the policy an argument to the BlockSpaceManagerV2, lets move the code that records last_accessed and computed into the PrefixCachingBlockAllocator. The code to manage and record these two variables will likely change if a non-LRU policy is used. For example, we wouldn't want to spend time recording last_accessed time if we switched to a FIFO eviction policy.
  4. Evictor unit tests would be good. The eviction logic is somewhat cumbersome to exercise in integration tests because it only runs when we run out of blocks.

vllm/core/block/prefix_caching_block.py Show resolved Hide resolved
pass


class Block():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of a name collision here. Can you rename this since "Block" is already used extensively? I went with BlockMetaData. Feel free to use that or a new name :). EvictorBlockInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockMetaData is nice, I'll take it~

@@ -351,6 +376,9 @@ def __init__(
self._prev_block = prev_block
self._cached_content_hash: Optional[int] = None
self._prefix_caching_allocator = prefix_caching_allocator
self.computed = False
self.num_hashed_tokens = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Instead of manually setting this when marking blocks as computed, you should be able to do something like:

    @property
    def num_tokens_total(self) -> int:
        return self._prev_block.num_tokens_total + len(self.token_ids)

You'll have to account for the _prev_block == None case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the num_tokens_total usage here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm talking about using the "num_tokens_total" property that I posted above as a replacement for "num_hashed_tokens". The primary benefit is that you won't have to update it as part of "mark_blocks_as_computed"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, add this property in latest commit

@leiwen83
Copy link
Contributor Author

I think moving incrementally here is good. Let's try to get this PR into a mergable state and go from there.

There are four high level changes that I would suggest for this PR.

  1. Move the core/block/evictor.py file to core/evictor_v2.py, similar to what was done for block_manager. Rename core/evictor.py to core/evictor_v1.py.
  2. Let's remove unused_cached_blocks from the PrefixCachingBlockAllocator and just rely on the evictor.
  3. Assuming that we don't want to make the policy an argument to the BlockSpaceManagerV2, lets move the code that records last_accessed and computed into the PrefixCachingBlockAllocator. The code to manage and record these two variables will likely change if a non-LRU policy is used. For example, we wouldn't want to spend time recording last_accessed time if we switched to a FIFO eviction policy.
  4. Evictor unit tests would be good. The eviction logic is somewhat cumbersome to exercise in integration tests because it only runs when we run out of blocks.

Great thanks for those comments! And I borrow the evictor test case from your PR, and add some change to make it pass.
Let's see what else need to be done for this PR.

@SageMoore
Copy link
Contributor

This looks reasonable to me. Thanks for working on it!

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Some missing things:

  • Can you add a PR description? it helps with fast reviews -- you can detail your approach, the testing, what's missing, etc. Shouldn't take long!
  • To call automatic prefix caching in BlockManagerV2 "done", we need end-to-end tests.
    • See this example
      def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator,
      test_llm_generator,
      batch_size):
      """Verify vLLM produces the same output with greedy sampling, when lookahead
      scheduling is used vs. not.
      Lookahead scheduling is not expected to modify the output, as it simply
      allocates empty slots ahead of the known token ids in a sliding fashion.
      This test constrains the total number of blocks to force preemption. It also
      varies the block size so that the lookahead size is less than and greater
      than the block size.
      """
      , which compares decoding output of a sequence with lookahead scheduling against output without lookahead scheduling and verifies they are the same.
    • Off the top of my head, we'll want to test the following in APC. Maybe @SageMoore can also help here, since it's a bit out of scope of the BlockManagerV2 itself.
      • Greedy output equality with APC and without APC
      • Greedy output equality with APC + preemption

tests/core/block/test_prefix_caching_block.py Outdated Show resolved Hide resolved
tests/core/block/test_prefix_caching_block.py Outdated Show resolved Hide resolved
tests/core/block/test_prefix_caching_block.py Outdated Show resolved Hide resolved
tests/core/block/test_prefix_caching_block.py Outdated Show resolved Hide resolved
vllm/core/block/cpu_gpu_block_allocator.py Outdated Show resolved Hide resolved
vllm/core/block_manager_v2.py Outdated Show resolved Hide resolved
vllm/core/block_manager_v2.py Outdated Show resolved Hide resolved
vllm/core/block_manager_v2.py Show resolved Hide resolved
vllm/core/block/cpu_gpu_block_allocator.py Outdated Show resolved Hide resolved
vllm/core/block/cpu_gpu_block_allocator.py Show resolved Hide resolved
This PR is for enabel APC feature over block manager V2.

APC core idea is like maintain a table mapping between content hash and
physical block. If one block is full, then it would be prompted to be
immutable. While if several immutable comes in sequence, we kept the
first physical block, while free the other.

Those freed physical block, aka no reference at scheduler, would be
housekept inside evictor, when we try to alloc block again under pressue,
and find no free one as in the initial state, we could resort to
evictor to pop one unrefered computed block, which is used to store
new data comes from upper layer.

The evictor policy could be changed, but current only implement LRU one.

Co-authored-by: Sage Moore <[email protected]>
@leiwen83
Copy link
Contributor Author

Thanks for the PR. Some missing things:

  • Can you add a PR description? it helps with fast reviews -- you can detail your approach, the testing, what's missing, etc. Shouldn't take long!

  • To call automatic prefix caching in BlockManagerV2 "done", we need end-to-end tests.

    • See this example

      def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator,
      test_llm_generator,
      batch_size):
      """Verify vLLM produces the same output with greedy sampling, when lookahead
      scheduling is used vs. not.
      Lookahead scheduling is not expected to modify the output, as it simply
      allocates empty slots ahead of the known token ids in a sliding fashion.
      This test constrains the total number of blocks to force preemption. It also
      varies the block size so that the lookahead size is less than and greater
      than the block size.
      """

      , which compares decoding output of a sequence with lookahead scheduling against output without lookahead scheduling and verifies they are the same.

    • Off the top of my head, we'll want to test the following in APC. Maybe @SageMoore can also help here, since it's a bit out of scope of the BlockManagerV2 itself.

      • Greedy output equality with APC and without APC
      • Greedy output equality with APC + preemption

Got it, I refactor the PR and make the change as suggested. For e2e test, I add two, one is prefill caching enabled while comparing v1 vs v2, and in v2 + preemption, w/ APC vs w/o APC.

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a TODO on the O(N^2) num_total_tokens. Otherwise LGTM.

Comment on lines 486 to 490
_block = self
self._cached_num_tokens_total = 0
while _block is not None:
self._cached_num_tokens_total += len(_block.token_ids)
_block = _block.prev_block
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still O(N^2) for the entire sequence, e.g. each block will perform O(N) scan. We should use the design I linked to you content_hash which is O(N).

I will not block the PR (we can optimize later) but you should add a TODO!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I would add this TODO note

@leiwen83 leiwen83 requested a review from cadedaniel May 1, 2024 13:59
Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leiwen83

@cadedaniel cadedaniel merged commit 24750f4 into vllm-project:main May 1, 2024
48 checks passed
@leiwen83 leiwen83 deleted the block_v2_prefix branch May 2, 2024 08:58
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants