Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix caching. #675

Merged

Conversation

popovaan
Copy link
Contributor

Port of #639

@Wovchena Wovchena added this pull request to the merge queue Jul 26, 2024
Merged via the queue into openvinotoolkit:releases/2024/3 with commit 406393f Jul 26, 2024
27 checks passed
@ilya-lavrenov ilya-lavrenov self-assigned this Jul 31, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
@openvinotoolkit openvinotoolkit deleted a comment from sandye51 Aug 2, 2024
src/cpp/src/block_manager.hpp Show resolved Hide resolved
src/cpp/src/block_manager.hpp Show resolved Hide resolved
src/cpp/src/block_manager.hpp Show resolved Hide resolved
src/cpp/src/block_manager.hpp Show resolved Hide resolved
if (!blocks.size()) {
return nullptr;
}
auto hash_block = std::min_element(std::begin(blocks), std::end(blocks), block_is_less);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we need to store blocks in already sorted manner?
In this case get_lru_block will take O(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store blocks in a sorted structure like priority_queue, get_lru_block() will be O(1) but get_block() by hash will be O(n), as we will need to loop over all blocks and check hashes:

KVCacheBlock::Ptr get_block(size_t hash) {

Currently with hash table get_block() is O(1). We can probably have two structures in evictor both hash table and priority_queue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we can try this optimization.

But maybe let's first perform comparison with vLLM implementation? We can measure how much time this piece of code takes averagely.

src/cpp/src/block_manager.hpp Show resolved Hide resolved

Output schedule(std::vector<SequenceGroup::Ptr>& sequence_groups) {
Output scheduler_output;

if (m_config.enable_prefix_caching)
_restore_cached_blocks(sequence_groups);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to perform it on each iteration? As far as I could remember, vLLM does this prefix caching during creation of sequence groups.

Copy link
Contributor Author

@popovaan popovaan Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, moved restoring of blocks to add_request().

src/cpp/src/scheduler.hpp Show resolved Hide resolved
if (sequence_group->get_num_processed_tokens() == 0)
m_block_manager.allocate(sequence, num_required_blocks, sequence_group->get_prompt_ids());
else
m_block_manager.append_slots(sequence_group);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need such condition based on get_num_processed_tokens ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append_slots() is needed, because after blocks restoring we might not need to allocate a new block, this situation happens only if sequence_group->get_num_processed_tokens() > 0.

But append_slots() can actually be used in both cases, so I removed the condition and used append_slots().

}
const char* data = reinterpret_cast<const char*>(content.data());
std::size_t size = content.size() * sizeof(content[0]);
return std::hash<std::string_view>{}(std::string_view(data, size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here hash computation for next block will recompute all hashes from previous blocks. Do you think we can optimize it? E.g. only once compute all hashes block by block, where current block relies on hashes of previous blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, changed hash computation in PR https://github.com/openvinotoolkit/openvino.genai/pull/758/files
Now hash is commutated from previous hashes plus tokens of only current block.

github-merge-queue bot pushed a commit that referenced this pull request Aug 14, 2024
@ilya-lavrenov ilya-lavrenov added the category: continuous batching Continuous batching label Oct 15, 2024
ScottZhang812 pushed a commit to ScottZhang812/_openvino.genai that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants