Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API) #8333

Open
1 task done
llsj14 opened this issue Sep 10, 2024 · 14 comments
Open
1 task done

Comments

@llsj14
Copy link
Contributor

llsj14 commented Sep 10, 2024

Motivation.

  • When using automatic prefix caching that manages blocks in an LRU (Least Recently Used) manner, it would be useful to add a pinned caching feature, where blocks are retained until a Time to Live (TTL) expires or a specific duration is reached.
  • The Anthropic API supports prompt caching with TTL, which refreshes as prompts and their corresponding blocks are reused. This functionality is currently not possible in vLLM, as prefix caching operates solely in LRU mode.
  • Adding pinned caching would enhance the control logic for caching by allowing additional flexibility. I am considering features such as TTL, fixed expiration times, and manual expiration for pinned caching.

Proposed Change.

  • Managing pinned caching at the block level can be complex. I believe managing it at the sequence level would suffice. Therefore, a PinnedCachingManager will handle pinned caching sequences, placed directly under the Scheduler.
  • To reduce implementation complexity, pinned caching will only be supported for GPU memory and will not allow swapping into CPU memory. Pinned caching will be restricted to the prefill stage to prevent swapping into CPU memory.
  • Expiration logic will include TTL (Anthropic-style), fixed time, and manual expiration options. These will be implemented as functions with arguments, allowing for the addition of other expiration strategies.
  • Manual expiration will also be useful, as users may want to manually expire pinned cached sequences and their associated blocks.
  • I added a pinned caching option to the sampling parameters and used an existing API. However, there is an issue regarding whether to add APIs for adding, expiring, and retrieving information about pinned cached sequences.

Feedback Period.

2 Weeks, 9/11-9/25

CC List.

cc. @alexm-neuralmagic @robertgshaw2-neuralmagic @Yard1 @cadedaniel @youkaichao

Any Other Things.

I have drafted code to implement these features and hope to refine it through discussions here.
#8334

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@llsj14 llsj14 added the RFC label Sep 10, 2024
@llsj14 llsj14 changed the title [RFC]: Pinned Caching with Automatic Prefix Caching [RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API) Sep 10, 2024
@robertgshaw2-redhat
Copy link
Collaborator

Interesting. How do you plan to handle a case where we run out of KV cache space in GPU memory before the expiration date?

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 10, 2024

Interesting. How do you plan to handle a case where we run out of KV cache space in GPU memory before the expiration date?

This is an issue that needs to be addressed. In my current approach, I've set a maximum number of blocks that can be allocated for pinned caching. (I haven’t yet implemented handling for duplicate blocks in my draft, but counting unique blocks won’t be a problem.)

If pinned caching requests exceed this maximum block limit, a RuntimeError will be raised with an appropriate message. However, I don't want to make engine stopped in this condition. Maybe there might be more effective alternatives to throwing a RuntimeError. Could you offer some advice on this?

@Yard1
Copy link
Collaborator

Yard1 commented Sep 10, 2024

In that case, it would probably make more sense to fail the request and not the engine.

@robertgshaw2-redhat
Copy link
Collaborator

I think this is an interesting idea, but fleshing out these corner cases related to what happens when resources are contended for is key to making this robust

@AaronFriel
Copy link

Would you support pinned caching in the same manner as the OpenAI chat completion API as Anthropic?

I'm not sure how you would support this for text completion APIs but I am curious.

And relevant to this issue is ensuring that the cache is precise to the token, so I imagine we'd want to set the KV cache block size to 1 token as suggested in #8306. My question there applies here: how substantial is the overhead on setting the KV cache block size to one token, does that introduce inefficiencies?

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 11, 2024

In that case, it would probably make more sense to fail the request and not the engine.

@Yard1
Yeah, I agree. Would it be better to raise a runtime error, or is there a more effective way to fail the request within the Scheduler?

I think this is an interesting idea, but fleshing out these corner cases related to what happens when resources are contended for is key to making this robust

@robertgshaw2-neuralmagic
Yes, that’s an important aspect, and I’d appreciate more feedback on this. I initially thought that setting a maximum number of blocks for pinned caching and defining strict rules (e.g., TTL, reuse counts) would mitigate heavy contention. However, I also want to maintain flexibility with vLLM's functionality, and managing API calls outside of vLLM is another option to consider.

Pinned caching ensures specific blocks are retained until certain conditions are met, even though they might have been evicted under an LRU policy. For this reason, contention in such situations seems unavoidable, but utilizing CPU memory could help alleviate it. I also want to expand the functionality to make use of CPU memory, but that would be a part of future plans.

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 11, 2024

@AaronFriel

Would you support pinned caching in the same manner as the OpenAI chat completion API as Anthropic?
I'm not sure how you would support this for text completion APIs but I am curious.

It was a bit challenging to revise OpenAI's completion API to function like Anthropic's API, because Anthropic uses cache control on certain parts of prompts, which might require additional parsing and adherence to prompt-specific rules.

I’ve added a pinned caching option to the completion API. This means the full input prompt undergoes a prefill operation and is then pinned in the cache. For example:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 1,
        "pinned_caching": true
    }'

Additionally, I’m considering adding APIs for managing pinned cached sequences, such as adding, expiring, and retrieving information, beyond just the completion API revision. I haven’t added the APIs yet and am currently looking for better ideas on how to proceed.

And relevant to this issue is ensuring that the cache is precise to the token, so I imagine we'd want to set the KV cache block size to 1 token as suggested in #8306. My question there applies here: how substantial is the overhead on setting the KV cache block size to one token, does that introduce inefficiencies?

Yes, I encountered a same issue. I haven’t measured it directly, but setting the block size to 1 token could introduce significant overhead. For example, I had issues with the eviction logic in prefix caching mode (#7209). I suspect that if the block size is reduced to 1 and the number of blocks increases, the logic for controlling and scheduling these blocks would lead to even higher overhead.

As a potential solution, I considered reusing only immutable blocks (those filled with token IDs), which seemed more efficient. In the current prefix caching setup, mutable blocks (those not fully populated with token IDs) cannot be cached or reused.

@Yard1
Copy link
Collaborator

Yard1 commented Sep 11, 2024

In that case, it would probably make more sense to fail the request and not the engine.

@Yard1 Yeah, I agree. Would it be better to raise a runtime error, or is there a more effective way to fail the request within the Scheduler?

I would suggest doing something similar that's already being done for requests that are over the input length limit. Raising exceptions will just kill the engine

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 12, 2024

I would suggest doing something similar that's already being done for requests that are over the input length limit. Raising exceptions will just kill the engine

Yes, I would include that validation logic instead of raising runtime errors.

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 12, 2024

I added a pinned caching option to the sampling parameters and used an existing API. However, there is an issue regarding whether to add APIs for adding, expiring, and retrieving information about pinned cached sequences.

What do you think about adding explicit pinned caching APIs(add, manual delete, retrieving pinned caching info) to the OpenAI server instead of including these options in the existing Completion API?

@AaronFriel
Copy link

AaronFriel commented Sep 18, 2024

@llsj14 The first thing that comes to mind is that explicit APIs add a new dimension to metering and cost analysis, but also security. If it's efficient to check if a prefix is cached, an attacker could brute force the cached prompts from other users by checking prefixes. And ironically, applying an LLM to guess extensions to the prefix make this process much more efficient than guessing one token at a time out of 100k-200k.

So I think for this RFC, for automatic prefix caching it must be possible to partition the cache by a key, e.g.: customer ID.

@llsj14
Copy link
Contributor Author

llsj14 commented Sep 18, 2024

@AaronFriel I agree with your opinion that a user(customer) ID is needed to distinguish which user the pinned cached prompt belongs to and to ensure that they only receive the information that belongs to them. Explicit APIs would make it much easier to handle this issue compared to reusing the existing completion API.

So I think for this RFC, for automatic prefix caching it must be possible to partition the cache by a key, e.g.: customer ID.

Partitioning the usage of KV memory by users sounds a bit complex for developing the feature at this moment, but I need to consider it further from a security perspective. It’s an interesting idea.

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@llsj14
Copy link
Contributor Author

llsj14 commented Jan 1, 2025

@akai-shuuichi Would there be any suggestions for resource contentions?
I read your comment on different feature request issue. I think moving the KV block to other media is a good idea, but at the same time, it is anyways needed a nice rule to deal with contentions on GPU memory for corner cases and performance.

@github-actions github-actions bot added unstale and removed stale labels Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants