Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

renxida · 2024-11-25T22:38:20Z

To manage the lifecycle of page allocations for an inference request, it may be important to use an interface to encapsulate:

a list of cached pages
a list of newly allocated pages
boundaries between cached and allocated pages
3 operations
- publish (make pages available in cache)
- release (make pages eligible for eviction)
- get_page_list (get the full list of pages for use in a vmfb kernel invocation)

class PageAllocation(ABC):
    """
    Abstract base class for page allocations in the cache.
    Subclasses only need to implement the core allocation methods.
    """
    @abstractmethod
    def get_page_list(self) -> List[PageInfo]:
        """Returns the list of pages that were allocated."""
        pass

    @abstractmethod
    def publish_pages(self) -> None:
        """
        Makes pages available to other requests after writing is complete.
        Associates tokens with pages and marks them as ready for reading.
        """
        pass

    @abstractmethod
    def release_pages(self) -> None:
        """
        Releases the allocation's reference to pages.
        Pages become eligible for eviction when their reference count reaches zero.
        """
        pass

renxida · 2024-11-25T22:41:57Z

We currently have this in InferenceExecRequest:

shark-ai/shortfin/python/shortfin_apps/llm/components/messages.py

Lines 55 to 81 in 0e74c39

    
           def cache_page_indices(self, max_len: int) -> list[int]: 
        
               if not self.locked_pages: 
        
                   return [] 
        
               indices = [p.index for p in self.locked_pages] 
        
               if len(indices) > max_len: 
        
                   return indices[0:max_len] 
        
               return indices 
        
           def free_cache_pages(self): 
        
               cache = self._cache 
        
               if cache: 
        
                   pages = self.locked_pages 
        
                   self._cache = None 
        
                   self.locked_pages = None 
        
                   cache.release_pages(pages) 
        
           def lock_initial_cache_pages( 
        
               self, cache: AttnPageCache, pages: list[AttnPageEntry] 
        
           ): 
        
               assert not self._cache 
        
               self._cache = cache 
        
               self.locked_pages = pages 
        
           def lock_new_cache_pages(self, cache: AttnPageCache, pages: list[AttnPageEntry]): 
        
               assert self._cache is cache 
        
               self.locked_pages.extend(pages)

renxida · 2024-11-25T22:44:49Z

The new methods would correspond to cache_page_indices and free_cache_pages.

The creation of a cache allocation should be used in lock_initial_cache_pages. lock_additional_cache_pages should acquire another PageAllocation, then destroy the original one - - due to caching, the newly acquired pages would overlap maximally with the existing pages.

Creates space for #593 (prefix-sharing) Coming next: #607 , which should be the last thing I do before I can check in my blocktrie implementation. Summary of changes: - copied over stella's cache.py and renamed it to page_pool.py - each inference request now notifies the cache when its pages are done written to

renxida · 2024-11-26T00:55:35Z

Implementing on #608

renxida mentioned this issue Nov 25, 2024

[tracking] Minimal prefix-sharing kv cache #593

Open

20 tasks

renxida mentioned this issue Nov 25, 2024

Replace AttnPagedCache with BasePagedAttentionCache #565

Merged

renxida linked a pull request Nov 28, 2024 that will close this issue

Implement PageAllocation as a handle into a PagedAttentionCache, allowing publishing and releasing an allocation via handle rather than cache #608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

renxida commented Nov 25, 2024 •

edited

Loading

renxida commented Nov 25, 2024 •

edited

Loading

renxida commented Nov 25, 2024

renxida commented Nov 26, 2024

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

Comments

renxida commented Nov 25, 2024 • edited Loading

renxida commented Nov 25, 2024 • edited Loading

renxida commented Nov 25, 2024

renxida commented Nov 26, 2024

renxida commented Nov 25, 2024 •

edited

Loading

renxida commented Nov 25, 2024 •

edited

Loading