-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607
Comments
We currently have this in InferenceExecRequest: shark-ai/shortfin/python/shortfin_apps/llm/components/messages.py Lines 55 to 81 in 0e74c39
|
The new methods would correspond to The creation of a cache allocation should be used in lock_initial_cache_pages. |
Creates space for #593 (prefix-sharing) Coming next: #607 , which should be the last thing I do before I can check in my blocktrie implementation. Summary of changes: - copied over stella's cache.py and renamed it to page_pool.py - each inference request now notifies the cache when its pages are done written to
Implementing on #608 |
To manage the lifecycle of page allocations for an inference request, it may be important to use an interface to encapsulate:
The text was updated successfully, but these errors were encountered: