-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add latency metrics #1870
Comments
I would suggest placing them in the engine - it will be more generic that way. |
I am working on a PR for this |
first draft #2316 |
#2764 looks to add a request level histogram of token throughput |
Hi, @Yard1 @robertgshaw2-neuralmagic When I do a
|
You need to make a request in order for the metrics to be populated |
I am making a request with curl command and then monitoring the /metrics end point. But I can't see the metrics like this screenshot. I think I may need to add something to api_serve.py to point to the metrics.py but unsure what. |
Are you curling either |
Yes.
|
This debugging isn't really relevant to this thread, I'm going to move further discussion to #2850, where it is. |
@hmellor hello, It seems that the discussion has moved to another place. Can this issue be closed? |
After #1662 (initial metrics support) and #1756 (refactoring chat endpoint), it will become practical to include latency metrics that's important to production (courtesy of @Yard1):
A natural place to do it would be in the LLM engine or chat completion API, which ever one is less intrusive.
The text was updated successfully, but these errors were encountered: