Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Open
wants to merge 1 commit into
base: r24.10
Choose a base branch
from

Conversation

BenjaminBraunDev
Copy link

@BenjaminBraunDev BenjaminBraunDev commented Nov 27, 2024

What does the PR do?

This PR adds code to HTTPAPIServer::HandleGenerate inside src/http_server.cc to add both kv_cache_utilization and max_token_capacity metrics composed from the existing prometheus metrics in TensorRT-LLM Backend's nv_trt_llm_kv_cache_block_metrics metric family.

This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License
    Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Where should the reviewer start?

Changes are contained to 2 files:

  • src/http_server.cc
  • src/http_server.h (the former's header file)

The changes start in HTTPAPIServer::HandleGenerate where the environment variable is checked and the header is written. There are 2 other funcitons below: HTTPAPIServer::MetricFamilyExtractor which parses serialized prometheus metrics into a vector of PromMetric (which have a map of their metric labels), and HTTPAPIServer::ExtractKVMetrics that pulls the values from the structured metrics and forms a header in the ORCA format specified by ORCA_HEADER_METRIC_TYPE. If there is no backend, no metrics found for the header, or an invalid format type the header is simply not written.

Test plan:

The feature is gated behind a feature flag in the form of the ORCA_HEADER_METRIC_TYPE environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an #ifdef and are only included if metrics are enabled during the Triton Server build.

Caveats:

This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily.

Background

This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.

Related Issues:

Screenshots

Response header before changes (or if ORCA_HEADER_METRIC_TYPE environment variable is unset):
orca_before

Response header with ORCA_HEADER_METRIC_TYPE="json":
orca_json

Response header with ORCA_HEADER_METRIC_TYPE="http":
orca_http

@BenjaminBraunDev BenjaminBraunDev changed the title ORCA Format KV Cache Utilization in Inference Response Header feat: ORCA Format KV Cache Utilization in Inference Response Header Dec 10, 2024
… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.
@BenjaminBraunDev BenjaminBraunDev marked this pull request as ready for review December 11, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant