Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

Open
renxida opened this issue Nov 25, 2024 · 3 comments · May be fixed by #608
Open

Comments

@renxida
Copy link
Contributor

renxida commented Nov 25, 2024

To manage the lifecycle of page allocations for an inference request, it may be important to use an interface to encapsulate:

  • a list of cached pages
  • a list of newly allocated pages
  • boundaries between cached and allocated pages
  • 3 operations
    • publish (make pages available in cache)
    • release (make pages eligible for eviction)
    • get_page_list (get the full list of pages for use in a vmfb kernel invocation)
class PageAllocation(ABC):
    """
    Abstract base class for page allocations in the cache.
    Subclasses only need to implement the core allocation methods.
    """
    @abstractmethod
    def get_page_list(self) -> List[PageInfo]:
        """Returns the list of pages that were allocated."""
        pass

    @abstractmethod
    def publish_pages(self) -> None:
        """
        Makes pages available to other requests after writing is complete.
        Associates tokens with pages and marks them as ready for reading.
        """
        pass

    @abstractmethod
    def release_pages(self) -> None:
        """
        Releases the allocation's reference to pages.
        Pages become eligible for eviction when their reference count reaches zero.
        """
        pass
@renxida
Copy link
Contributor Author

renxida commented Nov 25, 2024

We currently have this in InferenceExecRequest:

def cache_page_indices(self, max_len: int) -> list[int]:
if not self.locked_pages:
return []
indices = [p.index for p in self.locked_pages]
if len(indices) > max_len:
return indices[0:max_len]
return indices
def free_cache_pages(self):
cache = self._cache
if cache:
pages = self.locked_pages
self._cache = None
self.locked_pages = None
cache.release_pages(pages)
def lock_initial_cache_pages(
self, cache: AttnPageCache, pages: list[AttnPageEntry]
):
assert not self._cache
self._cache = cache
self.locked_pages = pages
def lock_new_cache_pages(self, cache: AttnPageCache, pages: list[AttnPageEntry]):
assert self._cache is cache
self.locked_pages.extend(pages)

@renxida
Copy link
Contributor Author

renxida commented Nov 25, 2024

The new methods would correspond to cache_page_indices and free_cache_pages.

The creation of a cache allocation should be used in lock_initial_cache_pages. lock_additional_cache_pages should acquire another PageAllocation, then destroy the original one - - due to caching, the newly acquired pages would overlap maximally with the existing pages.

renxida added a commit that referenced this issue Nov 26, 2024
Creates space for #593 (prefix-sharing)

Coming next: #607 , which should be the last thing I do before I can
check in my blocktrie implementation.

Summary of changes:
- copied over stella's cache.py and renamed it to page_pool.py
- each inference request now notifies the cache when its pages are done
written to
@renxida
Copy link
Contributor Author

renxida commented Nov 26, 2024

Implementing on #608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant