-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API) #8333
Comments
Interesting. How do you plan to handle a case where we run out of KV cache space in GPU memory before the expiration date? |
This is an issue that needs to be addressed. In my current approach, I've set a If pinned caching requests exceed this maximum block limit, a |
In that case, it would probably make more sense to fail the request and not the engine. |
I think this is an interesting idea, but fleshing out these corner cases related to what happens when resources are contended for is key to making this robust |
Would you support pinned caching in the same manner as the OpenAI chat completion API as Anthropic? I'm not sure how you would support this for text completion APIs but I am curious. And relevant to this issue is ensuring that the cache is precise to the token, so I imagine we'd want to set the KV cache block size to 1 token as suggested in #8306. My question there applies here: how substantial is the overhead on setting the KV cache block size to one token, does that introduce inefficiencies? |
@Yard1
@robertgshaw2-neuralmagic Pinned caching ensures specific blocks are retained until certain conditions are met, even though they might have been evicted under an LRU policy. For this reason, contention in such situations seems unavoidable, but utilizing CPU memory could help alleviate it. I also want to expand the functionality to make use of CPU memory, but that would be a part of future plans. |
It was a bit challenging to revise OpenAI's completion API to function like Anthropic's API, because Anthropic uses cache control on certain parts of prompts, which might require additional parsing and adherence to prompt-specific rules. I’ve added a pinned caching option to the completion API. This means the full input prompt undergoes a prefill operation and is then pinned in the cache. For example:
Additionally, I’m considering adding APIs for managing pinned cached sequences, such as adding, expiring, and retrieving information, beyond just the completion API revision. I haven’t added the APIs yet and am currently looking for better ideas on how to proceed.
Yes, I encountered a same issue. I haven’t measured it directly, but setting the block size to 1 token could introduce significant overhead. For example, I had issues with the eviction logic in prefix caching mode (#7209). I suspect that if the block size is reduced to 1 and the number of blocks increases, the logic for controlling and scheduling these blocks would lead to even higher overhead. As a potential solution, I considered reusing only immutable blocks (those filled with token IDs), which seemed more efficient. In the current prefix caching setup, mutable blocks (those not fully populated with token IDs) cannot be cached or reused. |
I would suggest doing something similar that's already being done for requests that are over the input length limit. Raising exceptions will just kill the engine |
Yes, I would include that validation logic instead of raising runtime errors. |
What do you think about adding |
@llsj14 The first thing that comes to mind is that explicit APIs add a new dimension to metering and cost analysis, but also security. If it's efficient to check if a prefix is cached, an attacker could brute force the cached prompts from other users by checking prefixes. And ironically, applying an LLM to guess extensions to the prefix make this process much more efficient than guessing one token at a time out of 100k-200k. So I think for this RFC, for automatic prefix caching it must be possible to partition the cache by a key, e.g.: customer ID. |
@AaronFriel I agree with your opinion that a user(customer) ID is needed to distinguish which user the pinned cached prompt belongs to and to ensure that they only receive the information that belongs to them. Explicit APIs would make it much easier to handle this issue compared to reusing the existing completion API.
Partitioning the usage of KV memory by users sounds a bit complex for developing the feature at this moment, but I need to consider it further from a security perspective. It’s an interesting idea. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
@akai-shuuichi Would there be any suggestions for resource contentions? |
Motivation.
Proposed Change.
PinnedCachingManager
will handle pinned caching sequences, placed directly under theScheduler
.Feedback Period.
2 Weeks, 9/11-9/25
CC List.
cc. @alexm-neuralmagic @robertgshaw2-neuralmagic @Yard1 @cadedaniel @youkaichao
Any Other Things.
I have drafted code to implement these features and hope to refine it through discussions here.
#8334
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: