-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] metrics query: range vector support streaming agg when no overlap #7380
[performance] metrics query: range vector support streaming agg when no overlap #7380
Conversation
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.7%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.3%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took quick look. And it looks super nice. thanks for this PR @liguozhong.
will take deeper look today and wait for few more eyes before merging.
Co-authored-by: Kaviraj Kanagaraj <[email protected]>
Co-authored-by: Kaviraj Kanagaraj <[email protected]>
Co-authored-by: Kaviraj Kanagaraj <[email protected]>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.3%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @liguozhong . thanks for your contribution. Looks great.
Also, I would like to see these things before merging it:
- tests that compare the results from both approaches. let's say, we can create some mock data, and run some queries against this data using
streamingAggregator
andbatchRangeVectorIterator
to compare the accuracy of the results. - We need to mention these changes in https://github.com/grafana/loki/blob/main/CHANGELOG.md
- It would be awesome if you could add more comments to the methods that you added with a description of how these methods work. for example,
loki/pkg/logql/range_vector.go
Line 146 in bc56d7e
// load the next sample range window. loki/pkg/logql/range_vector.go
Lines 280 to 284 in bc56d7e
// extrapolatedRate function is taken from prometheus code promql/functions.go:59 // extrapolatedRate is a utility function for rate/increase/delta. // It calculates the rate (allowing for counter resets if isCounter is true), // extrapolates if the first/last sample is close to the boundary, and returns // the result as either per-second (if isRate is true) or overall.
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.3%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.3%
+ loki 0% |
done, thank you for taking time to help me review this PR. |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.3%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.2%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.2%
+ loki 0% |
` level=info ts=2022-11-14T11:26:56.866609673Z caller=metrics.go:143 component=frontend org_id=1662_qaawmopdln latency=slow query="avg_over_time({log_type="service_metrics", module="api_server", operation="InvokeFunction"} |= total_bytes=22GB duration=1m12.497572269s |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
- logql -2.2%
+ loki 0% |
@liguozhong sorry for the delay. reviewing it ) |
Thanks, I am very patient. Our team’s plan for the second half of the year is to optimize the metrics query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks @liguozhong
…no overlap (grafana#7380) metrics query: range vector support streaming agg when no overlap
What this PR does / why we need it:
We try to speed up metrics query, we have 102 recording rules for BI analysis. Through the metrics query of logql。
avg_over_time({log_type="service_metrics", module="eerouter",operation="MonitorContainer"}[59s] | json | memoryLimitInMB>=4096 | unwrap containersRetiring | __error__="") by (accountID, memoryLimitInMB)
but logql often timeouts, which leads us to continuously reduce the range from [60s] to [30s] or even [5s].
### We hope that the granularity of bi can be at least 60s.
Therefore, our team focused on analyzing the range function of loki code, and we found that the current implementation is to obtain all data points(r.window) within 60s, and then do the calculation.
This is a bit like the batch calculation of hadoop or hive. Our team has some experience in storm before, and we modified it a little.
we Introduce the aggregation calculation xxx_over_time of the range into a streaming calculation, which can reduce a huge batch cache(r.window). This is very bad for our current data volume . Our data may reach 1 million points/60s.
streaming calculation do not need w.windows. continue calculation each sample.
The streaming mode cannot work well in the case of overlap, so in the case of overlap, this pr still use loki's existing batch agg calculation mode.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
benchmark report。Less memory reduces gc risk. and has a faster speed
BI analysis example
range = [30s] ,duration = 31s. very slow.
error log
msg="GET /loki/api/v1/query_range?direction=BACKWARD&end=1665472454000000000&limit=593&query=sum(count_over_time({log_type="service_metrics",module="api_server",operation="InvokeFunction"}|json|runtime="custom-container"|registryType="acree"|acclType="Default"|__error__=""[59s]))&shards=0_of_16&start=1665471552000000000&step=2.000000 (504) 1m12.284221346s Response: \"Request timed out, decrease the duration of the request or add more label matchers (prefer exact match over regex match) to reduce the amount of data processed.\\n\" ws: false; X-Query-Queue-Time: 58.69µs; X-Scope-Orgid: test; uber-trace-id: 3d3c62f404360341:0c0f61a3108670b4:670a8d2a689862e1:0;
Checklist
CONTRIBUTING.md
guideCHANGELOG.md
updateddocs/sources/upgrading/_index.md