Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us #7692

Open
jayakommuru opened this issue Oct 11, 2024 · 1 comment

Comments

@jayakommuru
Copy link

jayakommuru commented Oct 11, 2024

We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of b)

  1. Is request latency = rate(nv_inference_request_duration_us[1m]) / rate(nv_inference_exec_count[1m]) + nv_inference_queue_duration_us?
  2. Does nv_inference_request_duration_us include the queuing duration as well ? In documentation, it says its cumulative. can any one confirm?
  3. Are compute_input and compute_output duration also included in the nv_inference_request_duration_us ?
@jayakommuru
Copy link
Author

@oandreeva-nv can you help with this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant