Prefix caching. #675

popovaan · 2024-07-24T12:57:54Z

Port of #639

src/cpp/src/block_manager.hpp

ilya-lavrenov · 2024-08-02T15:49:47Z

src/cpp/src/block_manager.hpp

+        if (!blocks.size()) {
+            return nullptr;
+        }
+        auto hash_block = std::min_element(std::begin(blocks), std::end(blocks), block_is_less);


do you think we need to store blocks in already sorted manner?
In this case get_lru_block will take O(1)

If we store blocks in a sorted structure like priority_queue, get_lru_block() will be O(1) but get_block() by hash will be O(n), as we will need to loop over all blocks and check hashes:

openvino.genai/src/cpp/src/block_manager.hpp

Line 88 in 22ef6c4

KVCacheBlock::Ptr get_block(size_t hash) {

Currently with hash table get_block() is O(1). We can probably have two structures in evictor both hash table and priority_queue?

ok, we can try this optimization.

But maybe let's first perform comparison with vLLM implementation? We can measure how much time this piece of code takes averagely.

src/cpp/src/block_manager.hpp

ilya-lavrenov · 2024-08-02T15:51:26Z

src/cpp/src/scheduler.hpp


    Output schedule(std::vector<SequenceGroup::Ptr>& sequence_groups) {
        Output scheduler_output;

+        if (m_config.enable_prefix_caching)
+            _restore_cached_blocks(sequence_groups);


do we need to perform it on each iteration? As far as I could remember, vLLM does this prefix caching during creation of sequence groups.

Agree, moved restoring of blocks to add_request().

src/cpp/src/scheduler.hpp

ilya-lavrenov · 2024-08-02T15:52:16Z

src/cpp/src/scheduler.hpp

+                    if (sequence_group->get_num_processed_tokens() == 0)
+                        m_block_manager.allocate(sequence, num_required_blocks, sequence_group->get_prompt_ids());
+                    else 
+                        m_block_manager.append_slots(sequence_group);


why do we need such condition based on get_num_processed_tokens ?

append_slots() is needed, because after blocks restoring we might not need to allocate a new block, this situation happens only if sequence_group->get_num_processed_tokens() > 0.

But append_slots() can actually be used in both cases, so I removed the condition and used append_slots().

ilya-lavrenov · 2024-08-02T15:52:36Z

src/cpp/src/sequence_group.hpp

+        }
+        const char* data = reinterpret_cast<const char*>(content.data());
+        std::size_t size = content.size() * sizeof(content[0]);
+        return std::hash<std::string_view>{}(std::string_view(data, size));


here hash computation for next block will recompute all hashes from previous blocks. Do you think we can optimize it? E.g. only once compute all hashes block by block, where current block relies on hashes of previous blocks.

Agree, changed hash computation in PR https://github.com/openvinotoolkit/openvino.genai/pull/758/files
Now hash is commutated from previous hashes plus tokens of only current block.

Applied comments from #675

Applied comments from openvinotoolkit/openvino.genai#675

popovaan added 2 commits July 24, 2024 14:57

Implementation of prefix caching.

80d8dde

Minor correction.

1d1389e

Wovchena approved these changes Jul 26, 2024

View reviewed changes

Wovchena added this pull request to the merge queue Jul 26, 2024

Merged via the queue into openvinotoolkit:releases/2024/3 with commit 406393f Jul 26, 2024
27 checks passed

ilya-lavrenov self-assigned this Jul 31, 2024

openvinotoolkit deleted a comment from sandye51 Aug 2, 2024

ilya-lavrenov reviewed Aug 2, 2024

View reviewed changes

src/cpp/src/block_manager.hpp Show resolved Hide resolved

openvinotoolkit deleted a comment from sandye51 Aug 2, 2024

ilya-lavrenov reviewed Aug 2, 2024

View reviewed changes

popovaan mentioned this pull request Aug 9, 2024

Prefix caching improvements #758

Merged

github-merge-queue bot pushed a commit that referenced this pull request Aug 14, 2024

Prefix caching improvements (#758)

762fc93

Applied comments from #675

ilya-lavrenov added the category: continuous batching Continuous batching label Oct 15, 2024

ScottZhang812 pushed a commit to ScottZhang812/_openvino.genai that referenced this pull request Dec 23, 2024

Prefix caching improvements (#758)

f2c4cb9

Applied comments from openvinotoolkit/openvino.genai#675

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix caching. #675

Prefix caching. #675

popovaan commented Jul 24, 2024

ilya-lavrenov Aug 2, 2024

popovaan Aug 9, 2024

ilya-lavrenov Aug 9, 2024

ilya-lavrenov Aug 2, 2024

popovaan Aug 9, 2024 •

edited

Loading

ilya-lavrenov Aug 2, 2024

popovaan Aug 9, 2024

ilya-lavrenov Aug 2, 2024

popovaan Aug 9, 2024

Prefix caching. #675

Prefix caching. #675

Conversation

popovaan commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popovaan Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popovaan Aug 9, 2024 •

edited

Loading