feat: ORCA Format KV Cache Utilization in Inference Response Header #7839
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does the PR do?
This PR adds code to
HTTPAPIServer::HandleGenerate
insidesrc/http_server.cc
to add bothkv_cache_utilization
andmax_token_capacity
metrics composed from the existing prometheus metrics in TensorRT-LLM Backend'snv_trt_llm_kv_cache_block_metrics
metric family.This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.
Checklist
Agreement
<commit_type>: <Title>
pre-commit install, pre-commit run --all
)Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Where should the reviewer start?
Changes are contained to 2 files:
src/http_server.cc
src/http_server.h
(the former's header file)The changes start in
HTTPAPIServer::HandleGenerate
where the environment variable is checked and the header is written. There are 2 other funcitons below:HTTPAPIServer::MetricFamilyExtractor
which parses serialized prometheus metrics into a vector ofPromMetric
(which have a map of their metric labels),and HTTPAPIServer::ExtractKVMetrics
that pulls the values from the structured metrics and forms a header in the ORCA format specified byORCA_HEADER_METRIC_TYPE
. If there is no backend, no metrics found for the header, or an invalid format type the header is simply not written.Test plan:
The feature is gated behind a feature flag in the form of the
ORCA_HEADER_METRIC_TYPE
environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an#ifdef
and are only included if metrics are enabled during the Triton Server build.Caveats:
This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily.
Background
This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.
Related Issues:
Screenshots
Response header before changes (or if
ORCA_HEADER_METRIC_TYPE
environment variable is unset):Response header with
ORCA_HEADER_METRIC_TYPE="json"
:Response header with
ORCA_HEADER_METRIC_TYPE="http"
: