[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

BenjaminBraunDev · 2024-12-10T22:07:44Z

Feature Context

As part of a larger initiative to provide advanced model server integration with load-balancers such as LLM Instance Gateway, which demonstrates dramatic performance improvements, this feature would provision Triton Server with tools to easily include composite metrics in it's inference generate response header to communicate its state to the instance gateway for effective load balancing.

This doc captures the overall requirements for model servers to integrate with llm instance gateway. Luckily, Triton with TensorRT-LLM backend has lots of features/metrics that enable efficient load balancing such as the nv_trt_llm_kv_cache_block_metrics metric which we can use to derive a composite KV-cache utilization metric.

Server Integration

The goal is to provide a way to capture specific metrics and report them in the ORCA format through the Triton Server frontend. To do this we will parse specific metric families from the serialized metrics provided by Triton Core libraries that Triton Server exposes at its metrics endpoint.

This feature will be controlled by a runtime environment variable ORCA_METRIC_FORMAT. If not set in the environment where Triton Server is deployed, the feature is simply disabled by default. If set, ORCA_METRIC_FORMAT define the format for the contents of the endpoint-load-metrics response header as per the inline per-request formats. We plan to first implement "http" and "json" formats and later the binary protobuf format.

Therefore, currently the valid values for ORCA_METRIC_FORMAT are:

(unset)
"http"
"json"

Any other value will have functionality equivalent to it being unset while also logging and error.

Triton Server will report the captured metrics in that respective format in the response header. Should the desired metrics be unavailable, the server will simply not include this header in its response. Additionally, this feature will be entirely wrapped in an #ifdef and only be compiled if metrics are enabled on the server.

cc @yinggeh @krishung5 @jbkyang-nvi

The text was updated successfully, but these errors were encountered:

BenjaminBraunDev mentioned this issue Dec 10, 2024

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

BenjaminBraunDev commented Dec 10, 2024 •

edited

Loading

[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

Comments

BenjaminBraunDev commented Dec 10, 2024 • edited Loading

Feature Context

Server Integration

BenjaminBraunDev commented Dec 10, 2024 •

edited

Loading