Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] metrics query: range vector support streaming agg when no overlap #7380

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Oct 10, 2022

What this PR does / why we need it:

We try to speed up metrics query, we have 102 recording rules for BI analysis. Through the metrics query of logql。

avg_over_time({log_type="service_metrics", module="eerouter",operation="MonitorContainer"}[59s] | json | memoryLimitInMB>=4096 | unwrap containersRetiring | __error__="") by (accountID, memoryLimitInMB)

but logql often timeouts, which leads us to continuously reduce the range from [60s] to [30s] or even [5s].
### We hope that the granularity of bi can be at least 60s.

Therefore, our team focused on analyzing the range function of loki code, and we found that the current implementation is to obtain all data points(r.window) within 60s, and then do the calculation.

func (r *batchRangeVectorIterator) Next() bool {
	....
	r.load(rangeStart, rangeEnd)
	return true
}

func (r *batchRangeVectorIterator) load(start, end int64) {
	for lbs, sample, hasNext := r.iter.Peek(); hasNext; lbs, sample, hasNext = r.iter.Peek() {
		....
		p := promql.Point{
			T: sample.Timestamp,
			V: sample.Value,
		}
		series.Points = append(series.Points, p)
	}
}

func (r *batchRangeVectorIterator) At() (int64, promql.Vector) {
	...
        for _, series := range r.window {
		r.at = append(r.at, promql.Sample{
			Point: promql.Point{
				V: r.agg(series.Points),
				T: ts,
			},
			Metric: series.Metric,
		})
	}
	return ts, r.at
}

func countOverTime(samples []promql.Point) float64 {
	return float64(len(samples))
}

This is a bit like the batch calculation of hadoop or hive. Our team has some experience in storm before, and we modified it a little.

we Introduce the aggregation calculation xxx_over_time of the range into a streaming calculation, which can reduce a huge batch cache(r.window). This is very bad for our current data volume . Our data may reach 1 million points/60s.

streaming calculation do not need w.windows. continue calculation each sample.

type CountOverTime struct {
	count float64
}

func (a *CountOverTime) agg(sample promql.Point) {
	a.count++
}

func (a *CountOverTime) at() float64 {
	return a.count
}

The streaming mode cannot work well in the case of overlap, so in the case of overlap, this pr still use loki's existing batch agg calculation mode.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Benchmark_RangeVectorIteratorCompare
Benchmark_RangeVectorIteratorCompare/streaming_agg
Benchmark_RangeVectorIteratorCompare/streaming_agg-12         	  133740	      8583 ns/op	    3967 B/op	      65 allocs/op
Benchmark_RangeVectorIteratorCompare/batch_agg
Benchmark_RangeVectorIteratorCompare/batch_agg-12             	  100306	     11885 ns/op	   34702 B/op	      35 allocs/op
PASS

benchmark report。Less memory reduces gc risk. and has a faster speed

StreamingAgg
Benchmark_RangeVectorIterator-12    	  132342	      8214 ns/op	    3968 B/op	      66 allocs/op

BatchAgg
Benchmark_RangeVectorIterator-12    	   95154	     12521 ns/op	   34748 B/op	      36 allocs/op

BI analysis example
image
range = [30s] ,duration = 31s. very slow.
image
error log
msg="GET /loki/api/v1/query_range?direction=BACKWARD&end=1665472454000000000&limit=593&query=sum(count_over_time({log_type="service_metrics",module="api_server",operation="InvokeFunction"}|json|runtime="custom-container"|registryType="acree"|acclType="Default"|__error__=""[59s]))&shards=0_of_16&start=1665471552000000000&step=2.000000 (504) 1m12.284221346s Response: \"Request timed out, decrease the duration of the request or add more label matchers (prefer exact match over regex match) to reduce the amount of data processed.\\n\" ws: false; X-Query-Queue-Time: 58.69µs; X-Scope-Orgid: test; uber-trace-id: 3d3c62f404360341:0c0f61a3108670b4:670a8d2a689862e1:0;

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner October 10, 2022 12:02
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.7%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.3%
+               loki	0%

Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took quick look. And it looks super nice. thanks for this PR @liguozhong.

will take deeper look today and wait for few more eyes before merging.

pkg/logql/range_vector_test.go Outdated Show resolved Hide resolved
pkg/logql/range_vector_test.go Outdated Show resolved Hide resolved
pkg/logql/range_vector_test.go Outdated Show resolved Hide resolved
liguozhong and others added 3 commits October 11, 2022 18:48
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.3%
+               loki	0%

Copy link
Contributor

@vlad-diachenko vlad-diachenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liguozhong . thanks for your contribution. Looks great.
Also, I would like to see these things before merging it:

  1. tests that compare the results from both approaches. let's say, we can create some mock data, and run some queries against this data using streamingAggregator and batchRangeVectorIterator to compare the accuracy of the results.
  2. We need to mention these changes in https://github.com/grafana/loki/blob/main/CHANGELOG.md
  3. It would be awesome if you could add more comments to the methods that you added with a description of how these methods work. for example,
    // load the next sample range window.
    describes this method only in general. It would be helpful so see something like this:
    // extrapolatedRate function is taken from prometheus code promql/functions.go:59
    // extrapolatedRate is a utility function for rate/increase/delta.
    // It calculates the rate (allowing for counter resets if isCounter is true),
    // extrapolates if the first/last sample is close to the boundary, and returns
    // the result as either per-second (if isRate is true) or overall.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.3%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.3%
+               loki	0%

@liguozhong
Copy link
Contributor Author

Hey @liguozhong . thanks for your contribution. Looks great. Also, I would like to see these things before merging it:

  1. tests that compare the results from both approaches. let's say, we can create some mock data, and run some queries against this data using streamingAggregator and batchRangeVectorIterator to compare the accuracy of the results.
  2. We need to mention these changes in https://github.com/grafana/loki/blob/main/CHANGELOG.md
  3. It would be awesome if you could add more comments to the methods that you added with a description of how these methods work. for example,
    // load the next sample range window.

    describes this method only in general. It would be helpful so see something like this:
    // extrapolatedRate function is taken from prometheus code promql/functions.go:59
    // extrapolatedRate is a utility function for rate/increase/delta.
    // It calculates the rate (allowing for counter resets if isCounter is true),
    // extrapolates if the first/last sample is close to the boundary, and returns
    // the result as either per-second (if isRate is true) or overall.

done, thank you for taking time to help me review this PR.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.3%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.2%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.2%
+               loki	0%

@liguozhong
Copy link
Contributor Author

`

level=info ts=2022-11-14T11:26:56.866609673Z caller=metrics.go:143 component=frontend org_id=1662_qaawmopdln latency=slow query="avg_over_time({log_type="service_metrics", module="api_server", operation="InvokeFunction"} |= \"runtime\":\"custom-container\" | regexp \"registryType\":\"(?P<registryType>.*?)\"(,\|\\{\|\\[\|\\}) | regexp \"acclType\":\"(?P<acclType>.*?)\"(,\|\\{\|\\[\|\\}) | regexp \"latency\":(?P<latency>.*?)(,\|\\{\|\\[\|\\}) | unwrap latency | error="" [59s]) by (registryType, acclType)" query_type=metric range_type=range length=0s start_delta=16h57m16.866596501s end_delta=16h57m16.866596725s step=1s duration=1m12.497572269s status=200 limit=5000 returned_lines=0 throughput=304MB total_bytes=22GB total_entries=3 queue_time=15.283388s subqueries=2 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=162018 cache_chunk_bytes_fetched=0 cache_index_req=4497 cache_index_hit=4497 cache_result_req=0 cache_result_hit=0
`

total_bytes=22GB duration=1m12.497572269s
hi , @kavirajk if you have time, please help continue to review this PR. At present, we have more such slow queries and expect to speed up

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-2.2%
+               loki	0%

@vlad-diachenko
Copy link
Contributor

@liguozhong sorry for the delay. reviewing it )

@liguozhong
Copy link
Contributor Author

@liguozhong sorry for the delay. reviewing it )

Thanks, I am very patient. Our team’s plan for the second half of the year is to optimize the metrics query.
We will spend a long time on logql performance optimization, mainly to make our recording rule execute more smoothly. The recording rule is replacing the original BI report in our Hadoop, which is very flexible. But there is a disadvantage that it allows slower. Accelerating the metrics query will make loki more powerful.

Copy link
Contributor

@vlad-diachenko vlad-diachenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks @liguozhong

@vlad-diachenko vlad-diachenko merged commit 6d05ade into grafana:main Nov 29, 2022
@liguozhong
Copy link
Contributor Author

image

memory 20Gb -> 5Gb .

@liguozhong
Copy link
Contributor Author

image

less code:400 query

Abuelodelanada pushed a commit to canonical/loki that referenced this pull request Dec 1, 2022
…no overlap (grafana#7380)

metrics query: range vector support streaming agg when no overlap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants