You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of a larger initiative to provide advanced model server integration with load-balancers such as LLM Instance Gateway, which demonstrates dramatic performance improvements, this feature would provision Triton Server with tools to easily include composite metrics in it's inference generate response header to communicate its state to the instance gateway for effective load balancing.
This doc captures the overall requirements for model servers to integrate with llm instance gateway. Luckily, Triton with TensorRT-LLM backend has lots of features/metrics that enable efficient load balancing such as the nv_trt_llm_kv_cache_block_metrics metric which we can use to derive a composite KV-cache utilization metric.
Server Integration
The goal is to provide a way to capture specific metrics and report them in the ORCA format through the Triton Server frontend. To do this we will parse specific metric families from the serialized metrics provided by Triton Core libraries that Triton Server exposes at its metrics endpoint.
This feature will be controlled by a runtime environment variable ORCA_METRIC_FORMAT. If not set in the environment where Triton Server is deployed, the feature is simply disabled by default. If set, ORCA_METRIC_FORMAT define the format for the contents of the endpoint-load-metrics response header as per the inline per-request formats. We plan to first implement "http" and "json" formats and later the binary protobuf format.
Therefore, currently the valid values for ORCA_METRIC_FORMAT are:
(unset)
"http"
"json"
Any other value will have functionality equivalent to it being unset while also logging and error.
Triton Server will report the captured metrics in that respective format in the response header. Should the desired metrics be unavailable, the server will simply not include this header in its response. Additionally, this feature will be entirely wrapped in an #ifdef and only be compiled if metrics are enabled on the server.
Feature Context
As part of a larger initiative to provide advanced model server integration with load-balancers such as LLM Instance Gateway, which demonstrates dramatic performance improvements, this feature would provision Triton Server with tools to easily include composite metrics in it's inference generate response header to communicate its state to the instance gateway for effective load balancing.
This doc captures the overall requirements for model servers to integrate with llm instance gateway. Luckily, Triton with TensorRT-LLM backend has lots of features/metrics that enable efficient load balancing such as the
nv_trt_llm_kv_cache_block_metrics
metric which we can use to derive a composite KV-cache utilization metric.Server Integration
The goal is to provide a way to capture specific metrics and report them in the ORCA format through the Triton Server frontend. To do this we will parse specific metric families from the serialized metrics provided by Triton Core libraries that Triton Server exposes at its metrics endpoint.
This feature will be controlled by a runtime environment variable
ORCA_METRIC_FORMAT
. If not set in the environment where Triton Server is deployed, the feature is simply disabled by default. If set,ORCA_METRIC_FORMAT
define the format for the contents of theendpoint-load-metrics
response header as per the inline per-request formats. We plan to first implement "http" and "json" formats and later the binary protobuf format.Therefore, currently the valid values for
ORCA_METRIC_FORMAT
are:"http"
"json"
Any other value will have functionality equivalent to it being unset while also logging and error.
Triton Server will report the captured metrics in that respective format in the response header. Should the desired metrics be unavailable, the server will simply not include this header in its response. Additionally, this feature will be entirely wrapped in an
#ifdef
and only be compiled if metrics are enabled on the server.cc @yinggeh @krishung5 @jbkyang-nvi
The text was updated successfully, but these errors were encountered: