-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Top-K query optimization for ORDER BY <EXPR> [ASC|DESC] LIMIT n
#3516
Comments
Min heap is a reasonable data structure to solve the issue. |
The TopK operator can also be parallelized the same way as a multi-stage grouping Like run |
I think there are two items here
Currently a plan looks like the following
I think we should be translating it to:
Already with the current implementation we should be able to rewrite it to:
|
I agree this would help performance. However, it will become irrelevant if/when we implement a real TopK operator |
I imagine a similar optimization is still useful with a TopK operator. The example TopK operator in the tests still works only a single partition and can't run in parallel. I think you'll still need a "nested" TopK operator similar to the above plan to achieve the parallelization. So at the moment there is a TopK operator, we can change it to utilize this. Of course the TopK operator will use way less memory - but in terms of performance for simple queries that can be executed in memory, the rewrite should probably already have a big benefit. |
@Dandandan -- I had not thought of the "parallel sorts with limit" as a good intermediate state -- you are exactly right and that makes perfect sense |
After playing with it a bit more, I think this issue boils down to pushing a limit to |
So it seems we mostly have the benefits of a TopK operator now by pushing down the limit to individual operations. There are a couple of followups possible (will create some tickets for them and close this one):
|
is there an issue created for this follow-up? I didn't find it, so I create #6000. Feel free to close it if duplicated. Also, feel free to correct me if my understanding and expectation in #6000 are wrong. |
I believe this is completed by the work and we should track any upcoming issues somewhere else. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As explained by @alamb in the user defined plan:
In the ClickBench benchmarks (https://github.com/ClickHouse/ClickBench) we can see many queries that might benefit from this optimization, but we can see they are converted to a "naive" plan. See for example query 9:
Describe the solution you'd like
Implement this feature.
It basically means pushing the limit information down to
SortPreservingMergeStream
and keeping only the top N elements in the min_heap datastructure.Describe alternatives you've considered
n/a
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: