-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log time in queue per request #4949
Conversation
pkg/scheduler/scheduler.go
Outdated
r.queueSpan.Finish() | ||
|
||
level.Info(s.log).Log("msg", "querier request dequeued", "queryID", r.queryID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be interesting to add the querier id too so that we can see if they dequeue fairly across all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Wondering if the original plan is to bring this queue information as a part of logging in metrics.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kavirajk adding this information as part of logging in metrics.go makes sense, however, we already have this information as part of the scheduler request. To add this information as part of metrics, we would need to in some way pass it downstream to the logql metrics. I wasn't able to do that in a simpler way, but let me know if I missed something 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea! is it possible to get the tenant ID here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slim-bean good idea. Added in commit 2421761
From what I understood by looking at the enqueueRequest method, a query can be multi-tenant. Therefore, used the method tenant.TenantIDsFromOrgID
by using the scheduled request's orgId (r.userID
).
However, it is not clear the distinction between tenant and orgId in this context. What is the difference between tenant and org?
a5e55af
to
2421761
Compare
When a query is requested, it is split into smaller queries based on shards values. All these smaller queries are then handled by the query-scheduler to be distributed by the querier workers. This means that the outcome of this PR will result in an increase in the outputted log lines for each query requested. As a result, and taking into account the constant query requests from |
This will definitely introduce a lot of logging, multiple entries per query. I have a thought, I wonder if we should include a configureable threshold, say log only if it exceeds a certain time in queue. And if we did that it would be nice if it was part of the runtime config so we could change it if we wanted on the fly. https://github.com/grafana/loki/blob/main/pkg/runtime/config.go This would also allow it to be configured per tenant at runtime. The parameter could take a value of -1 for disabled, 0 for log all, and a duration for 'log slower than' What do we think? |
@ssncferreira could you include an example of what the log line looks like currently? |
I'm also wondering now if another approach would be to log this info in our 'metrics.go' line which could then show an aggregated result in the query frontend metrics.go |
As of commit 2421761, for the test query:
the log lines from this PR are shown as:
This simple test query, generated 3 subsets of queries:
Each of these subsets generates smaller queries with 16 shards parameters. This means that a simple test query generated 48 log lines. These tests were done on the dev cluster and can be seen here. |
@slim-bean yes, I'm starting to think that this is the best approach and simpler to then analyze the logs. And I think I prefer this approach to the configurable threshold. Nevertheless, I still need to investigate further how to achieve this...how to pass the enqueue time that is already calculated on the scheduler to the logql stats metrics 🕵️♀️ |
@ssncferreira I know that we currently send metrics in headers which can be printed by logcli. Could we take a similar approach where each sub-query could include this enqueue time in metadata in the request it sends back to the query frontend, and let the query frontend publish the metric for each sub-query once the query has finished? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit, but LGTM
731e6fb
to
40cf768
Compare
f0f0a84
to
3b11ac8
Compare
Commit 3b11ac8 presents a different solution that introduces the The new metrics.go logline is now presented as:
Additionally, the query stats field in the JSON response is now presented with an
Example in dev loki-bigtable:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. A few suggestions:
- Can we rename
enqueue_time
toqueue_time
? The former would be the moment it enters the queue, whereas the latter more correctly refers to how long the query was queued. - Can we store the queue time in nanoseconds (
int64(<time.Duration>)
)? This will give us more granular data, especially for queries which are queued less than a second (which is expected).
@@ -168,7 +171,8 @@ func (i *Ingester) Merge(m Ingester) { | |||
func (r *Result) Merge(m Result) { | |||
r.Querier.Merge(m.Querier) | |||
r.Ingester.Merge(m.Ingester) | |||
r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime + m.Summary.ExecTime) * float64(time.Second)))) | |||
r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime+m.Summary.ExecTime)*float64(time.Second))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required for this PR, but I wonder why we store execTime
as seconds instead of nanoseconds 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that this could also be stored as nanoseconds. This way it would also stay consistent with the queueTime
. I can address this in a separate PR 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is the associated PR: #5034
f63cc6e
to
eacd8d1
Compare
* Rename enqueue_time to queue_time * Store queue time in nanoseconds (int64) * Use CanonicalMIMEHeaderKey for setting the httpgrpc header
eacd8d1
to
e544557
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work.
@ssncferreira awesome work, this looks great! |
What this PR does / why we need it:
Add query enqueue time to result statistics and
metrics.go
.Incoming requests are handled by the query-frontend which pushes each request into an internal queue. The queries pull the requests from this queue and execute them. The time each request stays in the queue is observed by metric
cortex_query_scheduler_queue_duration_seconds
.The idea of this PR is to add this metric value to the query's result statistics as well as
metrics.go
logline. This is accomplished by setting a new headerX-Query-Enqueue-Time
in the query HTTP GRPC request at the scheduler. A new HTTP middleware functionExtractQueryMetricsMiddleware
is responsible for extracting this header at the queriers which is then used to populate the newEnqueueTime
field of the results statistics and themetrics.go
logline.Which issue(s) this PR fixes:
Fixes #4774
Special notes for your reviewer:
Checklist
CHANGELOG.md
about the changes.