Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: ORCA format reporting for KV-Cache metrics in Inference Response Header #7865

Open
BenjaminBraunDev opened this issue Dec 10, 2024 · 0 comments

Comments

@BenjaminBraunDev
Copy link

BenjaminBraunDev commented Dec 10, 2024

Feature Context

As part of a larger initiative to provide advanced model server integration with load-balancers such as LLM Instance Gateway, which demonstrates dramatic performance improvements, this feature would provision Triton Server with tools to easily include composite metrics in it's inference generate response header to communicate its state to the instance gateway for effective load balancing.

This doc captures the overall requirements for model servers to integrate with llm instance gateway. Luckily, Triton with TensorRT-LLM backend has lots of features/metrics that enable efficient load balancing such as the nv_trt_llm_kv_cache_block_metrics metric which we can use to derive a composite KV-cache utilization metric.

Server Integration

The goal is to provide a way to capture specific metrics and report them in the ORCA format through the Triton Server frontend. To do this we will parse specific metric families from the serialized metrics provided by Triton Core libraries that Triton Server exposes at its metrics endpoint.

This feature will be controlled by a runtime environment variable ORCA_METRIC_FORMAT. If not set in the environment where Triton Server is deployed, the feature is simply disabled by default. If set, ORCA_METRIC_FORMAT define the format for the contents of the endpoint-load-metrics response header as per the inline per-request formats. We plan to first implement "http" and "json" formats and later the binary protobuf format.

Therefore, currently the valid values for ORCA_METRIC_FORMAT are:

  • (unset)
  • "http"
  • "json"

Any other value will have functionality equivalent to it being unset while also logging and error.

Triton Server will report the captured metrics in that respective format in the response header. Should the desired metrics be unavailable, the server will simply not include this header in its response. Additionally, this feature will be entirely wrapped in an #ifdef and only be compiled if metrics are enabled on the server.

cc @yinggeh @krishung5 @jbkyang-nvi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant