Log time in queue per request #4949

ssncferreira · 2021-12-16T17:20:29Z

What this PR does / why we need it:

Add query enqueue time to result statistics and metrics.go.

Incoming requests are handled by the query-frontend which pushes each request into an internal queue. The queries pull the requests from this queue and execute them. The time each request stays in the queue is observed by metric cortex_query_scheduler_queue_duration_seconds.
The idea of this PR is to add this metric value to the query's result statistics as well as metrics.go logline. This is accomplished by setting a new header X-Query-Enqueue-Time in the query HTTP GRPC request at the scheduler. A new HTTP middleware function ExtractQueryMetricsMiddleware is responsible for extracting this header at the queriers which is then used to populate the new EnqueueTime field of the results statistics and the metrics.go logline.

Which issue(s) this PR fixes:
Fixes #4774

Special notes for your reviewer:

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

cyriltovena · 2021-12-16T21:01:58Z

pkg/scheduler/scheduler.go

 		r.queueSpan.Finish()

+		level.Info(s.log).Log("msg", "querier request dequeued", "queryID", r.queryID,


Could be interesting to add the querier id too so that we can see if they dequeue fairly across all of them.

Q: Wondering if the original plan is to bring this queue information as a part of logging in metrics.go?

#4774

@kavirajk adding this information as part of logging in metrics.go makes sense, however, we already have this information as part of the scheduler request. To add this information as part of metrics, we would need to in some way pass it downstream to the logql metrics. I wasn't able to do that in a simpler way, but let me know if I missed something 🙂

I really like this idea! is it possible to get the tenant ID here as well?

@slim-bean good idea. Added in commit 2421761
From what I understood by looking at the enqueueRequest method, a query can be multi-tenant. Therefore, used the method tenant.TenantIDsFromOrgID by using the scheduled request's orgId (r.userID).

However, it is not clear the distinction between tenant and orgId in this context. What is the difference between tenant and org?

ssncferreira · 2021-12-21T11:13:29Z

When a query is requested, it is split into smaller queries based on shards values. All these smaller queries are then handled by the query-scheduler to be distributed by the querier workers. This means that the outcome of this PR will result in an increase in the outputted log lines for each query requested.

As a result, and taking into account the constant query requests from loki-canary this can significantly increase the generated logs.
@grafana/loki-team Could this be a potential problem? 🤔

slim-bean · 2021-12-22T12:18:46Z

This will definitely introduce a lot of logging, multiple entries per query.

I have a thought, I wonder if we should include a configureable threshold, say log only if it exceeds a certain time in queue.

And if we did that it would be nice if it was part of the runtime config so we could change it if we wanted on the fly. https://github.com/grafana/loki/blob/main/pkg/runtime/config.go

This would also allow it to be configured per tenant at runtime.

The parameter could take a value of -1 for disabled, 0 for log all, and a duration for 'log slower than'

What do we think?

slim-bean · 2021-12-22T12:23:43Z

@ssncferreira could you include an example of what the log line looks like currently?

slim-bean · 2021-12-22T12:25:33Z

I'm also wondering now if another approach would be to log this info in our 'metrics.go' line which could then show an aggregated result in the query frontend metrics.go

ssncferreira · 2021-12-22T13:33:55Z

@ssncferreira could you include an example of what the log line looks like currently?

As of commit 2421761, for the test query:

{container="query-frontend",namespace="loki-bigtable"} |= "scf50"

the log lines from this PR are shown as:

level=info ts=2021-12-20T19:01:49.130700149Z caller=scheduler.go:477 msg="querier request dequeued" tenant_ids=3927 querier_id=querier-6d44f945fc-2gfqp query_id=17445911836402265139 request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908919000000&limit=1000&query={container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"&shards=0_of_16&start=1640026608919000000&step=0.200000" enqueue_time(ms)=0.045099

This simple test query, generated 3 subsets of queries:

sum by count over time: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908000000000&limit=1261&query=sum+by(level)(count_over_time({container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"[1s]))&shards=0_of_16&start=1640026800000000000&step=1.000000"
sum by count over time with different start param: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026799000000000&limit=1261&query=sum+by(level)(count_over_time({container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"[1s]))&shards=0_of_16&start=1640026608000000000&step=1.000000"
the query itself: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908919000000&limit=1000&query={container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"&shards=0_of_16&start=1640026608919000000&step=0.200000" enqueue_time(ms)=0.045099

Each of these subsets generates smaller queries with 16 shards parameters. This means that a simple test query generated 48 log lines. These tests were done on the dev cluster and can be seen here.

ssncferreira · 2021-12-22T13:46:32Z

I'm also wondering now if another approach would be to log this info in our 'metrics.go' line which could then show an aggregated result in the query frontend metrics.go

@slim-bean yes, I'm starting to think that this is the best approach and simpler to then analyze the logs. And I think I prefer this approach to the configurable threshold.
However, since a query is split into multiple sub-queries, how would the enqueue time for a single query be calculated? As the sum of the enqueue time of all its sub-queries? Or as the average? 🤔

Nevertheless, I still need to investigate further how to achieve this...how to pass the enqueue time that is already calculated on the scheduler to the logql stats metrics 🕵️‍♀️

trevorwhitney · 2021-12-22T16:10:45Z

@ssncferreira I know that we currently send metrics in headers which can be printed by logcli. Could we take a similar approach where each sub-query could include this enqueue time in metadata in the request it sends back to the query frontend, and let the query frontend publish the metric for each sub-query once the query has finished?

owen-d

one nit, but LGTM

pkg/scheduler/scheduler.go

ssncferreira · 2021-12-30T17:22:07Z

Commit 3b11ac8 presents a different solution that introduces the enqueueTime as part of the metrics.go log. This is accomplished by adding an additional header X-Query-Enqueue-Time in the query HTTP GRPC request in the scheduler. This header is then used to populate the new EnqueueTime field of the results statistics.

The new metrics.go logline is now presented as:

level=info ts=2021-12-30T17:18:18.162215879Z caller=metrics.go:115 org_id=fake latency=fast query="{filename=\"/var/log/app.log\"}" query_type=limited range_type=range length=1h0m1s step=5s duration=3.919166ms status=200 limit=1000 returned_lines=48 throughput=891kB total_bytes=3.5kB enqueue_time=131.291µs

Additionally, the query stats field in the JSON response is now presented with an enqueueTime field:

"stats": {
    "summary": {
        "bytesProcessedPerSecond":891005,
        "linesProcessedPerSecond":12247,
        "totalBytesProcessed":3492,
        "totalLinesProcessed":48,
        "execTime":0.003919166,
        "enqueueTime":0.000131291
    }
    ...
}

Example in dev loki-bigtable:

Query example:

{container="query-frontend", namespace="loki-bigtable"} |= "scf105"

metrics.go result: link here

owen-d

This is looking great. A few suggestions:

Can we rename enqueue_time to queue_time? The former would be the moment it enters the queue, whereas the latter more correctly refers to how long the query was queued.
Can we store the queue time in nanoseconds (int64(<time.Duration>))? This will give us more granular data, especially for queries which are queued less than a second (which is expected).

pkg/logqlmodel/stats/context.go

owen-d · 2021-12-31T14:11:01Z

pkg/logqlmodel/stats/context.go

@@ -168,7 +171,8 @@ func (i *Ingester) Merge(m Ingester) {
 func (r *Result) Merge(m Result) {
 	r.Querier.Merge(m.Querier)
 	r.Ingester.Merge(m.Ingester)
-	r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime + m.Summary.ExecTime) * float64(time.Second))))
+	r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime+m.Summary.ExecTime)*float64(time.Second))),


Not required for this PR, but I wonder why we store execTime as seconds instead of nanoseconds 🤔

Yes, I agree that this could also be stored as nanoseconds. This way it would also stay consistent with the queueTime. I can address this in a separate PR 👍

Here it is the associated PR: #5034

pkg/util/httpreq/tags.go

pkg/scheduler/scheduler.go

* Rename enqueue_time to queue_time * Store queue time in nanoseconds (int64) * Use CanonicalMIMEHeaderKey for setting the httpgrpc header

owen-d

LGTM, nice work.

slim-bean · 2022-01-05T00:25:19Z

@ssncferreira awesome work, this looks great!

pull-request-size bot added the size/XS label Dec 16, 2021

cyriltovena reviewed Dec 16, 2021

View reviewed changes

pull-request-size bot added size/S and removed size/XS labels Dec 20, 2021

ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from a5e55af to 2421761 Compare December 20, 2021 13:28

ssncferreira self-assigned this Dec 20, 2021

ssncferreira marked this pull request as ready for review December 20, 2021 18:35

ssncferreira requested a review from a team as a code owner December 20, 2021 18:35

owen-d approved these changes Dec 29, 2021

View reviewed changes

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

pull-request-size bot added size/L and removed size/S labels Dec 30, 2021

ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from 731e6fb to 40cf768 Compare December 30, 2021 16:09

ssncferreira requested a review from KMiller-Grafana as a code owner December 30, 2021 16:09

ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch 6 times, most recently from f0f0a84 to 3b11ac8 Compare December 30, 2021 16:51

ssncferreira requested a review from a team December 30, 2021 19:28

owen-d reviewed Dec 31, 2021

View reviewed changes

ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from f63cc6e to eacd8d1 Compare January 4, 2022 13:33

ssncferreira added 6 commits January 4, 2022 13:37

Log time in queue per request

3d3112b

Add querierID to log message

3737279

Add unescape request url and tenant_ids to log message

1c4ee8c

Add query enqueue time to metrics.go

6af8190

Remove "querier request dequeued" log from scheduler

d89adb5

Address comments

e544557

* Rename enqueue_time to queue_time * Store queue time in nanoseconds (int64) * Use CanonicalMIMEHeaderKey for setting the httpgrpc header

ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from eacd8d1 to e544557 Compare January 4, 2022 13:38

ssncferreira requested a review from owen-d January 4, 2022 14:23

ssncferreira mentioned this pull request Jan 4, 2022

Store execTime metric in nanoseconds #5034

Closed

3 tasks

owen-d approved these changes Jan 4, 2022

View reviewed changes

owen-d merged commit 82f8f3c into main Jan 4, 2022

owen-d deleted the issue/4774-log_request_enqueue_time branch January 4, 2022 21:52

ssncferreira mentioned this pull request Jan 5, 2022

Store metrics queueTime in seconds #5052

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log time in queue per request #4949

Log time in queue per request #4949

ssncferreira commented Dec 16, 2021 •

edited

Loading

cyriltovena Dec 16, 2021

kavirajk Dec 17, 2021

ssncferreira Dec 17, 2021

slim-bean Dec 17, 2021

ssncferreira Dec 20, 2021

ssncferreira commented Dec 21, 2021 •

edited

Loading

slim-bean commented Dec 22, 2021

slim-bean commented Dec 22, 2021

slim-bean commented Dec 22, 2021

ssncferreira commented Dec 22, 2021

ssncferreira commented Dec 22, 2021 •

edited

Loading

trevorwhitney commented Dec 22, 2021

owen-d left a comment

ssncferreira commented Dec 30, 2021 •

edited

Loading

owen-d left a comment

owen-d Dec 31, 2021

ssncferreira Jan 4, 2022

ssncferreira Jan 4, 2022

owen-d left a comment

slim-bean commented Jan 5, 2022

		r.queueSpan.Finish()

		level.Info(s.log).Log("msg", "querier request dequeued", "queryID", r.queryID,

Log time in queue per request #4949

Log time in queue per request #4949

Conversation

ssncferreira commented Dec 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssncferreira commented Dec 21, 2021 • edited Loading

slim-bean commented Dec 22, 2021

slim-bean commented Dec 22, 2021

slim-bean commented Dec 22, 2021

ssncferreira commented Dec 22, 2021

ssncferreira commented Dec 22, 2021 • edited Loading

trevorwhitney commented Dec 22, 2021

owen-d left a comment

Choose a reason for hiding this comment

ssncferreira commented Dec 30, 2021 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owen-d left a comment

Choose a reason for hiding this comment

slim-bean commented Jan 5, 2022

ssncferreira commented Dec 16, 2021 •

edited

Loading

ssncferreira commented Dec 21, 2021 •

edited

Loading

ssncferreira commented Dec 22, 2021 •

edited

Loading

ssncferreira commented Dec 30, 2021 •

edited

Loading