feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

BenjaminBraunDev · 2024-11-27T20:59:18Z

What does the PR do?

This PR adds code to HTTPAPIServer::HandleGenerate inside src/http_server.cc to add both kv_cache_utilization and max_token_capacity metrics composed from the existing prometheus metrics in TensorRT-LLM Backend's nv_trt_llm_kv_cache_block_metrics metric family.

This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Where should the reviewer start?

Changes are contained to 2 files:

src/http_server.cc
src/http_server.h (the former's header file)

The changes start in HTTPAPIServer::HandleGenerate where the environment variable is checked and the header is written. There are 2 other funcitons below: HTTPAPIServer::MetricFamilyExtractor which parses serialized prometheus metrics into a vector of PromMetric (which have a map of their metric labels), and HTTPAPIServer::ExtractKVMetrics that pulls the values from the structured metrics and forms a header in the ORCA format specified by ORCA_HEADER_METRIC_TYPE. If there is no backend, no metrics found for the header, or an invalid format type the header is simply not written.

Test plan:

The feature is gated behind a feature flag in the form of the ORCA_HEADER_METRIC_TYPE environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an #ifdef and are only included if metrics are enabled during the Triton Server build.

Caveats:

This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily.

Background

This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.

Related Issues:

Resolves GitHub issue: #7865

Screenshots

Response header before changes (or if ORCA_HEADER_METRIC_TYPE environment variable is unset):

Response header with ORCA_HEADER_METRIC_TYPE="json":

Response header with ORCA_HEADER_METRIC_TYPE="http":

… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.

BenjaminBraunDev force-pushed the r24.10 branch from 43a1b18 to 713c8de Compare December 9, 2024 18:07

BenjaminBraunDev changed the title ~~ORCA Format KV Cache Utilization in Inference Response Header~~ feat: ORCA Format KV Cache Utilization in Inference Response Header Dec 10, 2024

Add helper functions to pull metrics in HTTPAPIServer to pull metrics…

74492a8

… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.

BenjaminBraunDev force-pushed the r24.10 branch from 713c8de to 74492a8 Compare December 11, 2024 02:16

BenjaminBraunDev marked this pull request as ready for review December 11, 2024 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

BenjaminBraunDev commented Nov 27, 2024 •

edited

Loading

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Are you sure you want to change the base?

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Conversation

BenjaminBraunDev commented Nov 27, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues:

Screenshots

BenjaminBraunDev commented Nov 27, 2024 •

edited

Loading